WO2012045492A1 - Content retrieval system - Google Patents

Content retrieval system Download PDF

Info

Publication number
WO2012045492A1
WO2012045492A1 PCT/EP2011/061691 EP2011061691W WO2012045492A1 WO 2012045492 A1 WO2012045492 A1 WO 2012045492A1 EP 2011061691 W EP2011061691 W EP 2011061691W WO 2012045492 A1 WO2012045492 A1 WO 2012045492A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
word
speech
documents
grammar
Prior art date
Application number
PCT/EP2011/061691
Other languages
French (fr)
Inventor
Joseph Allemendou
Noel Fitzpatrick
John Kelleher
Brian Macnamee
Eamonn Newman
Louise Veling
Original Assignee
Dublin Institute Of Technology
National Digital Research Centre Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dublin Institute Of Technology, National Digital Research Centre Limited filed Critical Dublin Institute Of Technology
Publication of WO2012045492A1 publication Critical patent/WO2012045492A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Definitions

  • the present invention relates to a system and method for content retrieval.
  • a teacher intends to make use of an authentic text to highlight a particular grammatical structure or structures, and these are generally much more relevant or interesting for students, the teacher has to identify a structure that is both useful and challenging for the class they are teaching and then decide how best to exploit that particular structure for further work. It is highly likely that the teacher will have to go beyond the actual text and produce an exercise related to the example or examples in the text in order to give substantial practice of the chosen structure. For example, the teacher might ask the students to find (underline, highlight) an example or examples of a particular structure in the text, as follows:
  • paragraph 1 find an example of a passive sentence.
  • a content retrieval system said system including a grammatical analyzer cooperable with a set of grammar feature functions, at least one of said functions defining a pattern comprising: a required part- of-speech for at least one word of a sentence within a document, a set of character constraints for the word and a dependency relationship between said word and another word in said sentence within a document, said grammatical analyzer being arranged to apply at least one grammar feature function to a set of documents to identify documents including text conforming to said grammar feature function definition, said system further including a user interface component arranged to enable a user to specify at least one of said set of grammar feature functions and to provide to said user a filtered set of said documents according to the documents including text conforming to said grammar feature function.
  • At least one of said functions comprises a plurality of patterns, at least one of said patterns comprising a required part-of-speech for at least one word of a sentence within a document, a set of character constraints for the word and a dependency relationship between said word and another word in said sentence within a document.
  • said dependency relationship comprises one or both of a permitted relationship or non-permitted relationship.
  • said character constraints comprise one or both of a permitted set of characters or a non-permitted set of characters.
  • at least one of said functions comprises a pattern comprising a required part-of-speech for at least two words of a sentence within a document, a set of character constraints for the at least two words and a dependency relationship between said two words.
  • said system further comprises a tokenizer arranged to identify individual words within a document, a part of speech analyzer arranged to provide a part-of- speech tag to each word of said document and a dependency analyzer arranged to generate grammatical relationships between words of said document.
  • said analyzers are arranged to output an analyzed document in XML format including said part-of-speech tags and said grammatical relationships.
  • said system is arranged to obtain documents for analysis from the Internet and said grammatical analyzer is arranged to analyse and tag obtained documents for each defined grammar feature function and to store said analysis information in association with said document.
  • said user interface component is arranged to use said stored analysis information in providing said filtered set of documents.
  • said user interface component is arranged to display said filtered set ranked according to their relevance to said user specified grammar feature.
  • said user interface component is arranged to display a selected document for said filtered set in an enhanced manner.
  • said enhancement comprises one or more of: highlighting text within said document confirming to said user specified grammar feature; abstracting said document prior to display; displaying a glossary for said document; or displaying a set of exercises for said document.
  • said grammar feature functions define patterns for one of more of: an active form of a verb; a passive form of a verb; or a non-reflexive pronoun.
  • Figure 1 is a schematic diagram of a content retrieval system according to a preferred embodiment of the present invention
  • Figures 2(a) and 2(b) illustrate the structural analysis performed by modules of the embodiment
  • Figure 3 shows the components for grammar feature functions employed within the system of Figure 1 ;
  • Figures 4 to 7 show examples of sentences highlighting the parts of speech tags for individual words as well as the dependency relationships between words of the sentences; and
  • Figure 8 is a schematic view of a user application screen for accessing information within the content retrieval system of Figure 1 .
  • a content retrieval system 10 according to an embodiment of the present invention.
  • the system is based on documents 12 gathered from the Internet (news articles, book passages, blogs etc) in a text format or indeed any suitable source.
  • Each document 12 is processed by an analyzer 14 to generate an XML file 16 containing all the document and analysis information.
  • the analyses used are:
  • Tokenization where a document is split into tokens, for example, words.
  • a suitable tokenizer is Apache Lucene
  • the tokeniser takes the text document 12 and splits it into paragraphs, sentences and words based on a set of rules.
  • a suitable Part of Speech Tagger is Stanford Log-linear Part-Of-Speech Tagger (http://nip.stanford.edu/software/tagger.shtmi). Also, Kristina Toutanova and
  • PCFG probabilistic context-free grammar parsing generates the grammatical structure of a sentence within a document, such as shown in Figure 2(a).
  • Named-entity recognition identifies the proper nouns in a sentence.
  • a suitable a n a l y z e r i Stanford Named Entity Recogniser http://nlp.stanford.edu/software/CRF-NER.shtml) and this is further described in Jenny Rose Finkel, Trond Grenager, and Christopher Manning, "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling", Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
  • original documents can be extended to include meta-information indicating at least the part of speech for each word of a document, the grammatical structure of phrases and sentences within the document and the inter-dependencies of words within phrases and sentences of a document.
  • the content retrieval system includes a grammatical analyzer 18 which can be applied to the XML documents 16 to identify passages within these documents including the required grammatical feature.
  • the analyzer 18 outputs processed documents along with their grammatical feature information (how many of each feature, where the feature appears in the document etc) which are indexed within a database 20, in addition to basic document information (source, category etc).
  • a grammar feature function 19 is built and stored. For example, a grammar feature function is built to identify personal pronouns, and another to identify verbs in the past perfect tense.
  • the grammar feature functions 19 are applied by the analyzer 18 to each analysed document 16 as input in a specific XML format, and the analyzer provides an updated document as an output, preferably in the same specific XML format, for storage in the database 20.
  • the resulting updated document contains information regarding the occurrences in the document of the grammar feature. This information encompasses the location of the occurrences in the document, and potentially any other specific data related to the grammatical feature which might be useful.
  • the grammar features 19 can use information present in the analysed XML documents 16 provided by the processor 14, including a word itself, the part of speech associated to the word, the grammar structure tree, the grammatical relation between words, and the tagged named-entities.
  • Figure 3 shows a schematic of the hierarchy of objects providing a set of grammar features 1 ...j.
  • grammar features 1 ...j are at a level that is useful to language teachers or learners, for example, a feature might correspond with a particular tense.
  • These comprise combinations of grammar patterns 1 ...k which in turn are particular combinations of specific part-of-speech tags, dependency relations and textual patterns that define the different ways in which a grammar feature can be instantiated.
  • Grammar patterns 1 ...k themselves are micro-programs which are implemented as parameterised instantiations of generic matchers 1 ...m.
  • Generic matchers 1 ...m match generic patterns based on specific of part-of-speech tags etc provided by the individual analysers 14.
  • Verbs in the present simple tense can have multiple forms:
  • the "one entity" generic matcher is one of the simpler generic matchers in the content retrieval system and is used to implement grammar patterns that can be recognized through the presence of a single word that satisfies a number of constraints.
  • the constraints used to identify words selected by this generic matcher are as follows: • The part-of-speech tag associated with the word is in a given list of acceptable part-of-speech tags.
  • character constraints include character patterns that the word must match (e.g. ending in "ed") and character strings that the word must not match (e.g. not ending in "ing”).
  • a set of acceptable part-of-speech tags (the set of possible part-of-speech tags is defined by the part-of-speech tagger used within the system) must be provided.
  • the part-of-speech tag associated with the first word in the pair is in a list of allowed part-of-speech tags.
  • the part-of-speech tag associated with the second word in the pair is in a list of allowed part-of-speech tags.
  • the second word in the pair satisfies a set of given character constraints (as described for the "one entity” generic matcher). • The first and second words in the pair share one of the relations given in an allowed relation set, but do not share a relation of the same type with any other word.
  • a grammar pattern representing each of the forms of the simple present tense outlined above is implemented as parameterised instantiations of the "one entity” and "two related entities” generic matchers. 1 . Active form of a verb without auxiliary not in the 3rd person singular
  • an active form of a verb without auxiliary in the present simple not at the 3 rd person singular ⁇ s found using a part-of-speech (verb at present tense not 3rd person of singular), and by ensuring that the word found is not used in a composed tense such as the present continuous.
  • the verb "write” has the correct part-of-speech, and no auxiliary or passive auxiliary relation. It is therefore returned as an active form of a verb without auxiliary in the present simple not in the 3rd person singular.
  • a passive form of a verb not in the 3rd person singular is found using two parts of speech (verb at present tense not 3rd person singular and verb in past participle), and by ensuring the words found share a passive auxiliary relation, and are not involved in any other passive auxiliary or standard auxiliary relation.
  • the grammar feature function for the present simple tense comprises a combination of the 6 grammar patterns defined above and can be applied to documents to highlight and tag instances of such a grammar feature within the document.
  • Verbs in the past continuous tense have two forms:
  • the Past Continuous grammar feature uses two grammar patterns corresponding to the grammatical forms identified above. These two patterns are generated from two generic matchers instantiated with different parameters:
  • the "three related entities" generic matcher searches for sets of three related words in a document that satisfy the following constraints:
  • the part-of-speech tag associated with the 1 st word in the set is in a list of allowed part-of-speech tags.
  • the part-of-speech tag associated with the 2nd word in the set is in a list of allowed part-of-speech tags.
  • the part-of-speech tag associated with the 3rd word in the set is in a list of allowed part-of-speech tags. • The 1 st word in the set satisfies a set of given character constraints (as described for the "one entity" generic matcher).
  • an active form of past continuous is found using two parts of speech (verb at past tense and verb at present participle), and by ensuring the words found share an auxiliary relation, and no other auxiliary or passive auxiliary relation is shared from any of them.
  • the second word must not be the word "going”, since it would potentially be an example of a future event described in the past tense, e.g. "He was going to write the document'.
  • a passive form of past continuous is found using three parts of speech (verb at past tense, verb at present participle and verb at past participle), and by ensuring the first and third words found share an auxiliary relation, and the second and third words found share a passive auxiliary relation. Also, none of the three words are involved in any other auxiliary or passive auxiliary relation.
  • the grammar feature function for the past continuous comprises a combination of the 2 grammar patterns defined above and can be applied to documents to highlight and tag instances of such a grammar feature within the document.
  • non-reflexive personal pronouns are /, we, you, he, she, it, and they. Finding the occurrence of just these words, however, is not enough to determine their use as non-reflexive personal pronouns.
  • the PRP part-of-speech tag In order to determine that the occurrence of one of these words is being used as a non-reflexive personal pronoun, the PRP part-of-speech tag must also be used. So, the non-reflexive personal pronouns can be implemented simply using the "one entity" generic matcher with the following parameters:
  • a user interface application 22 can be provided to allow a user to search for suitable authentic texts within the database.
  • the application 22 can be a stand-alone application or a web based application or indeed both types of application can be used to access the database 20.
  • the user interface can comprise a search frame and a results frame.
  • the search frame enables keywords to be entered in a text entry box 80 to limit or rank documents according to the number of instances of those words contained within the documents.
  • a list box 82 can be provided to enable a user to select document complexity from, for example, A1 to C2.
  • a meta-tag box 84 can be used to allow users to select documents to which meta- tags, e.g. Sport have been applied by other users of the system, so dividing documents within the database by subject.
  • One or more date entry fields 86 can be provided to enable users to select documents acquired or published within or around a given time period.
  • the user interface application 22 of the present invention enables users to select one or more grammar features 88, for example, past continuous, present simple.
  • the application 22 can now select documents including such features as well as any other criteria specified through the boxes 80-86.
  • the application 22 can rank selected documents Di...D n according to the predominance of the feature within the document and display these accordingly within the results frame (with the option to go to the next results page if n>x).
  • a user simply selected a given document and is vectored to either the original web page from which the document was source or a locally cached version of the document - according to the applicable licensing arrangements.
  • the user interface application 22 can of course be extended to apply any number of different post processing operations to documents provided from the database 20.
  • the text of the document can be extracted from the various other information in the web page and instead the text displayed for the user can be enhanced, for example, to include a glossary of terms translating keywords of the document into a target language.
  • the passages of text corresponding to the grammar features of the query can be highlighted.
  • the document can be abstracted to condense the source document into passages in which a grammar feature is used.
  • the application 22 can be arranged either to provide a print version of the document including for example exercise questions of the type outlined in the introduction above; or the application can include interactive
  • the application 22 would retrieve suitable documents, but the user rather than the application would highlight the language.
  • the system in the illustrated embodiment can continually acquire and process documents through the modules 14 and 18 both from the internet or any continually updated sources, the system is able to provide the most relevant authentic texts for teachers and students so maximizing their time spent teaching and learning a language.

Abstract

A content retrieval system includes a grammatical analyzer cooperable with a set of grammar feature functions, at least one of the functions defining a pattern comprising: a required part-of-speech for at least one word of a sentence within a document, a set of character constraints for the word and a dependency relationship between the word and another word in the sentence within a document. The grammatical analyzer is arranged to apply at least one grammar feature function to a set of documents to identify documents including text conforming to the grammar feature function definition. The system further includes a user interface component arranged to enable a user to specify at least one of the set of grammar feature functions and to provide to the user a filtered set of the documents according to the documents including text conforming to the grammar feature function.

Description

Content Retrieval System
The present invention relates to a system and method for content retrieval.
It has long been the practice for graded texts in language course books to contain numerous examples of the same grammatical structure in order to highlight the use of that structure in context. For example, a graded text might contain several examples of grammatical structure. Unless the text is skilfully written, this will often have the effect of making the style of the text rather stilted and unrealistic but the aim of highlighting the structure is nonetheless generally achieved with this type of presentation. Typically, the text is then followed by comprehension questions that again highlight the structure and then various grammar-based practice exercises to reinforce its use in some kind of context.
In the case of authentic reading texts, for example a newspaper article, however, it is rarely the case that a particular text will have numerous examples of the same grammatical structure. It is far more likely to employ a wide range of structures of varying complexity and may refer to past, present and future time and make use of both progressive and perfect aspects, as well as both active and passive voices.
If a teacher intends to make use of an authentic text to highlight a particular grammatical structure or structures, and these are generally much more relevant or interesting for students, the teacher has to identify a structure that is both useful and challenging for the class they are teaching and then decide how best to exploit that particular structure for further work. It is highly likely that the teacher will have to go beyond the actual text and produce an exercise related to the example or examples in the text in order to give substantial practice of the chosen structure. For example, the teacher might ask the students to find (underline, highlight) an example or examples of a particular structure in the text, as follows:
1 . In paragraph 1 , find an example of a passive sentence.
2. Underline all the passive sentences in the text.
3. How many different tenses can you find expressed in the passive voice in the text? The advantage of focusing on a structure or structures in an authentic text such as a newspaper article is that it enables the students to see these structures functioning in an authentic context. The disadvantage can be that it is rarely sufficient simply to observe or to notice and, in order to give further related practice, the teacher will almost certainly need to develop an exercise in addition to the text.
It is an object of the present invention to provide a content retrieval system which overcomes the above problems and readily provides texts of relevance to both a language teacher and their students for assisting in teaching and learning a language. According to the present invention there is provided a content retrieval system, said system including a grammatical analyzer cooperable with a set of grammar feature functions, at least one of said functions defining a pattern comprising: a required part- of-speech for at least one word of a sentence within a document, a set of character constraints for the word and a dependency relationship between said word and another word in said sentence within a document, said grammatical analyzer being arranged to apply at least one grammar feature function to a set of documents to identify documents including text conforming to said grammar feature function definition, said system further including a user interface component arranged to enable a user to specify at least one of said set of grammar feature functions and to provide to said user a filtered set of said documents according to the documents including text conforming to said grammar feature function.
Preferably, at least one of said functions comprises a plurality of patterns, at least one of said patterns comprising a required part-of-speech for at least one word of a sentence within a document, a set of character constraints for the word and a dependency relationship between said word and another word in said sentence within a document.
Preferably, said dependency relationship comprises one or both of a permitted relationship or non-permitted relationship.
Preferably, said character constraints comprise one or both of a permitted set of characters or a non-permitted set of characters. Preferably, at least one of said functions comprises a pattern comprising a required part-of-speech for at least two words of a sentence within a document, a set of character constraints for the at least two words and a dependency relationship between said two words. Preferably, said system further comprises a tokenizer arranged to identify individual words within a document, a part of speech analyzer arranged to provide a part-of- speech tag to each word of said document and a dependency analyzer arranged to generate grammatical relationships between words of said document.
Further preferably, said analyzers are arranged to output an analyzed document in XML format including said part-of-speech tags and said grammatical relationships.
Preferably, said system is arranged to obtain documents for analysis from the Internet and said grammatical analyzer is arranged to analyse and tag obtained documents for each defined grammar feature function and to store said analysis information in association with said document. Preferably, said user interface component is arranged to use said stored analysis information in providing said filtered set of documents.
Preferably, said user interface component is arranged to display said filtered set ranked according to their relevance to said user specified grammar feature.
Preferably, said user interface component is arranged to display a selected document for said filtered set in an enhanced manner.
Further preferably, said enhancement comprises one or more of: highlighting text within said document confirming to said user specified grammar feature; abstracting said document prior to display; displaying a glossary for said document; or displaying a set of exercises for said document. Preferably, said grammar feature functions define patterns for one of more of: an active form of a verb; a passive form of a verb; or a non-reflexive pronoun.
An embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings, in which: Figure 1 is a schematic diagram of a content retrieval system according to a preferred embodiment of the present invention;
Figures 2(a) and 2(b) illustrate the structural analysis performed by modules of the embodiment; Figure 3 shows the components for grammar feature functions employed within the system of Figure 1 ;
Figures 4 to 7 show examples of sentences highlighting the parts of speech tags for individual words as well as the dependency relationships between words of the sentences; and Figure 8 is a schematic view of a user application screen for accessing information within the content retrieval system of Figure 1 .
Referring now to the drawings, there is shown schematically, a content retrieval system 10 according to an embodiment of the present invention.
The system is based on documents 12 gathered from the Internet (news articles, book passages, blogs etc) in a text format or indeed any suitable source.
Each document 12 is processed by an analyzer 14 to generate an XML file 16 containing all the document and analysis information. In the embodiment, the analyses used are:
• Tokenization, where a document is split into tokens, for example, words. A suitable tokenizer is Apache Lucene
(http://lucene.apache.Org/iava/3 0 2/index.html) , wh i ch provid es a docu m ent searching and indexing API for a tokeniser. The tokeniser takes the text document 12 and splits it into paragraphs, sentences and words based on a set of rules.
• Part of speech tagging where the part of speech associated with each token of the document is generated. For instance, tokens such as those in the first line below might be provided respective parts of speech tags listed in the second line:
Colourless Green ideas sleep furiously adjective Adjective Noun Verb adverb
A suitable Part of Speech Tagger is Stanford Log-linear Part-Of-Speech Tagger (http://nip.stanford.edu/software/tagger.shtmi). Also, Kristina Toutanova and
Christopher D. Manning, "Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger", In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora
(EMNLP/VLC-2000), pp. 63-70; and Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer, "Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network", In Proceedings of HLT-NAACL 2003, pp. 252-259 describe the operation of such a tagger.
· PCFG (probabilistic context-free grammar) parsing generates the grammatical structure of a sentence within a document, such as shown in Figure 2(a).
• Dependency parsing generates the grammatical relations between the words, for example as shown in Figure 2(b).
A suitable PCFG Parser and Dependency Parser is the Stanford Parser
(http://nlp.stanford.edu/software/lex-parser.shtml). This is further described in Dan Klein and Christopher D. Manning, "Fast Exact Inference with a Factored Model for Natural Language Parsing", In Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 3-10, 2003; and Dan Klein and Christopher D. Manning, "Accurate Unlexicalized Parsing", Proceedings of the 41 st Meeting of the Association for Computational Linguistics, pp. 423-430, 2003.
• Named-entity recognition identifies the proper nouns in a sentence. A suitable a n a l y z e r i s Stanford Named Entity Recogniser http://nlp.stanford.edu/software/CRF-NER.shtml) and this is further described in Jenny Rose Finkel, Trond Grenager, and Christopher Manning, "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling", Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
Other analyses of a document can also be cond ucted so that for example, documents can be analyzed for their linguistic complexity producing a grading for the document corresponding, for example from CEFR level A1 to C2.
It should be appreciated that the tags above as well as the structures of Figures 2(a) and 2(b) shown in human readable form are illustrative only and in practice the XML document 16 produced by the above analysis stages would instead include more encoded information reflecting these structures and tags and which would be more readily used in the processing described below.
In any case, it will be appreciated that using the combination of analysers described above, original documents can be extended to include meta-information indicating at least the part of speech for each word of a document, the grammatical structure of phrases and sentences within the document and the inter-dependencies of words within phrases and sentences of a document.
These analyses on their own however do not address the problem outlined above facing a teacher wishing to identify suitable authentic texts or portions of such texts, which would be useful for students of a language. According to the present invention, the content retrieval system includes a grammatical analyzer 18 which can be applied to the XML documents 16 to identify passages within these documents including the required grammatical feature.
The analyzer 18 outputs processed documents along with their grammatical feature information (how many of each feature, where the feature appears in the document etc) which are indexed within a database 20, in addition to basic document information (source, category etc).
The particular operation of the analyzer 18 will be described in more detail below, however, once prepared, users can search the database 20 for documents that contain the grammatical features they are interested in teaching, as well as filtering documents within the database using conventional parameters including keywords contained, date, source etc.
For each of the grammatical features that are to be available for identification in a language teaching/learning context, a grammar feature function 19 is built and stored. For example, a grammar feature function is built to identify personal pronouns, and another to identify verbs in the past perfect tense.
The grammar feature functions 19 are applied by the analyzer 18 to each analysed document 16 as input in a specific XML format, and the analyzer provides an updated document as an output, preferably in the same specific XML format, for storage in the database 20. The resulting updated document contains information regarding the occurrences in the document of the grammar feature. This information encompasses the location of the occurrences in the document, and potentially any other specific data related to the grammatical feature which might be useful.
The grammar features 19 can use information present in the analysed XML documents 16 provided by the processor 14, including a word itself, the part of speech associated to the word, the grammar structure tree, the grammatical relation between words, and the tagged named-entities.
Figure 3 shows a schematic of the hierarchy of objects providing a set of grammar features 1 ...j. As mentioned above, grammar features 1 ...j are at a level that is useful to language teachers or learners, for example, a feature might correspond with a particular tense. These comprise combinations of grammar patterns 1 ...k which in turn are particular combinations of specific part-of-speech tags, dependency relations and textual patterns that define the different ways in which a grammar feature can be instantiated. Grammar patterns 1 ...k themselves are micro-programs which are implemented as parameterised instantiations of generic matchers 1 ...m. Generic matchers 1 ...m match generic patterns based on specific of part-of-speech tags etc provided by the individual analysers 14.
The following provide examples of grammar features available within the system of the present embodiment.
Example 1 - The Present Simple Tense
Verbs in the present simple tense can have multiple forms:
1 . Active form of a verb without auxiliary not in the 3rd person singular "I write the document."
2. Active form of a verb without auxiliary in the 3rd person singular
"He writes the document."
3. Active form of a verb with auxiliary not in the 3rd person singular
"Do you write the document?"
4. Active form of a verb with auxiliary in the 3rd person singular
"Does he write the document?"
5. Passive form of a verb not in the 3rd person singular
"The documents are written by him"
6. Passive form of a verb in the 3rd person singular
"The document is written by him."
Forms in the 3rd person singular are differentiated from others, as regular verbs in the present tense take an added final "s".
The use of the present simple tense grammar feature is recognised as sequences of text that match any one of six specific grammar patterns - one for each of the grammatical forms identified above (we will refer to the patterns with the numbers used above). Those six patterns are implemented as instantiations of two generic matchers using different parameters. These generic matchers are:
• The "one entity" generic matcher, which is used to match patterns 1 and 2.
• The "two related entities" generic matcher, which is used to match patterns 3 to 6. The "one entity" generic matcher
The "one entity" generic matcher is one of the simpler generic matchers in the content retrieval system and is used to implement grammar patterns that can be recognized through the presence of a single word that satisfies a number of constraints. The constraints used to identify words selected by this generic matcher are as follows: • The part-of-speech tag associated with the word is in a given list of acceptable part-of-speech tags.
• The characters in the word satisfy a set of character constraints. These
character constraints include character patterns that the word must match (e.g. ending in "ed") and character strings that the word must not match (e.g. not ending in "ing").
• The word has none of the relations given in a set of disallowed relations with any other word in the document. In order to instantiate a grammar pattern (Figure 3) based on the "one entity" generic matcher three parameters must be given values:
• A set of acceptable part-of-speech tags (the set of possible part-of-speech tags is defined by the part-of-speech tagger used within the system) must be provided.
· A set of character string pattern constraints (e.g. ends with "ing", does not contain "est' etc) must be given.
• A set of disallowed relations must be provided (possible relations are defined by the dependency parser used in the system). The "two related entities" generic matcher
The "two related entities" generic matcher returns every pair of words in a document that satisfies the following constraints:
• The part-of-speech tag associated with the first word in the pair is in a list of allowed part-of-speech tags.
• The part-of-speech tag associated with the second word in the pair is in a list of allowed part-of-speech tags.
• The first word in the pair satisfies a set of given character constraints (as
described for the "one entity" generic matcher).
· The second word in the pair satisfies a set of given character constraints (as described for the "one entity" generic matcher). • The first and second words in the pair share one of the relations given in an allowed relation set, but do not share a relation of the same type with any other word.
• The first and second words in the pair do not share any of the relations
defined in a disallowed relations set with any word in the document.
To create a grammar pattern (Figure 3) based on the "two related entities" generic matcher, a parameter for each of these constraints must be provided:
• A set of allowed part-of-speech tags for the first word in the pair.
• A set of allowed part-of-speech tags for the second word in the pair.
• A set of character constraints for the first word in the pair.
• A set of character constraints for the second word in the pair.
• The set of relations allowed between the two words in the pair.
• The set of disallowed relations.
A grammar pattern representing each of the forms of the simple present tense outlined above is implemented as parameterised instantiations of the "one entity" and "two related entities" generic matchers. 1 . Active form of a verb without auxiliary not in the 3rd person singular
This form of the simple present is implemented using the "one entity" generic matcher with the following parameters:
• Acceptable part-of-speech tags: <VBP> (a verb in the present tense, not 3rd person singular)
· Character constraints: None
• Disallowed relations: <AUX> (standard auxiliary) and <AUXPASS> (passive form auxiliary)
Therefore, an active form of a verb without auxiliary in the present simple not at the 3rd person singular \s found using a part-of-speech (verb at present tense not 3rd person of singular), and by ensuring that the word found is not used in a composed tense such as the present continuous. In the example in Figure 4, the verb "write" has the correct part-of-speech, and no auxiliary or passive auxiliary relation. It is therefore returned as an active form of a verb without auxiliary in the present simple not in the 3rd person singular. In a second example given in Figure 5, the verb "am" has the correct part-of-speech, but since it shares an auxiliary relation with "writing", it won't be matched as an example of the present simple by this grammar pattern (which is correct since it is an example of the present continuous tense). 2. Active form of a verb without auxiliary in the 3rd person singular
This grammar pattern is very similar to the previous one, the only change being that the allowed set of part-of-speech tags changes to a verb in the third person:
• Acceptable part-of-speech tags: <VBZ> (a verb in the present tense, in the 3rd person singular)
• Character constraints: None
• Disallowed relations: <AUX> (standard auxiliary) and <AUXPASS> (passive form auxiliary) 3. Active form of a verb with auxiliary not in the 3rd person singular
This grammar pattern uses the "two related entity generic matcher with the following parameters:
• Acceptable part-of-speech tags for 1 st word: <VBP> (a verb in the present tense not 3rd person singular)
• Acceptable part-of-speech tags for 2nd word: <VB> (a verb in base form)
• 1 st word character constraints: None
• 2nd word character constraints: None
• Allowed relations between 1 st and 2nd words: <AUX> (standard auxiliary) · Disallowed relations: <AUXPASS> (passive form auxiliary) Therefore, an active form of a verb with auxiliary not in the 3rd person singular is found using two parts of speech (verb in the present tense, not 3rd person of singular and verb in base form), and by ensuring that the two words found share an auxiliary relation and do not share another auxiliary or passive auxiliary relation with any other word.
In the example in Figure 6, the verbs "do" and "write" have the correct parts of speech, and an auxiliary relation is shared between them. They don't have any other auxiliary or passive auxiliary relation, therefore they are returned as an active form of a verb with auxiliary not in the 3rd person singular.
In the example given in Figure 7, the verb "am" has the right part of speech, but the verb "writing" does not, so even though they share an auxiliary relation, they do not satisfy this grammar pattern and so are not returned as an example of the present simple tense (which is correct as this is an example of the present continuous).
4. Active form of a verb with auxiliary in the 3rd person singular
This grammar pattern is very similar to the previous one, the only change is that the first part of speech parameter changes from verb not 3rd person singular to verb 3rd person singular. Thus, the parameters used are:
• Acceptable part-of-speech tags for 1 st word: <VBZ> (a verb in the present
• tense 3rd person singular)
• Acceptable part-of-speech tags for 2nd word: <VB> (a verb in base form) · 1 st word character constraints: None
• 2nd word character constraints: None
• Allowed relations between 1 st and 2nd words: <AUX> (standard auxiliary)
• Disallowed relations: <AUXPASS> (passive form auxiliary) 5. Passive form of a verb at present simple not at the 3rd person of singular This pattern is similar to the previous two, but the second verb is a past participle instead of a base form, and instead of matching standard auxiliary relations and rejecting passive auxiliary ones, we do the opposite. Again the "two related entity generic matcher is used, but with the following parameters:
· Acceptable part-of-speech tags for 1 st word: <VBP> (a verb in the present tense not 3rd person singular)
• Acceptable part-of-speech tags for 2nd word: <VBN> (a verb in past participle form)
• 1 st word character constraints: None
· 2nd word character constraints: None
• Allowed relations between 1 st and 2nd words: <AUXPASS> (passive form auxiliary)
• Disallowed relations: <AUX> (standard auxiliary)
Therefore, a passive form of a verb not in the 3rd person singular is found using two parts of speech (verb at present tense not 3rd person singular and verb in past participle), and by ensuring the words found share a passive auxiliary relation, and are not involved in any other passive auxiliary or standard auxiliary relation.
6. Passive form of a verb at present simple at the 3rd person of singular
This grammar pattern is very similar to the previous one, since only the first part of speech parameter changes from verb not 3rd person singular to verb 3rd person singular:
• Acceptable part-of-speech tags for 1 st word: <VBZ> (a verb in the present tense 3rd person singular)
• Acceptable part-of-speech tags for 2nd word: <VBN> (a verb in past participle form)
• 1 st word character constraints: None
• 2nd word character constraints: None
· Allowed relations between 1 st and 2nd words: <AUXPASS> (passive form auxiliary)
• Disallowed relations: <AUX> (standard auxiliary) Thus, the grammar feature function for the present simple tense comprises a combination of the 6 grammar patterns defined above and can be applied to documents to highlight and tag instances of such a grammar feature within the document.
Example 2 - Past Continuous
Verbs in the past continuous tense have two forms:
1 . Active form
"I was writing the document."
2. Passive form
"The document was being written by him." The Past Continuous grammar feature uses two grammar patterns corresponding to the grammatical forms identified above. These two patterns are generated from two generic matchers instantiated with different parameters:
• The "two related entities" generic matcher described in the previous example is used to match the active form pattern.
· The "three related entities" generic matcher is used to match the passive form pattern.
The "three related entities" generic matcher
The "three related entities" generic matcher searches for sets of three related words in a document that satisfy the following constraints:
• The part-of-speech tag associated with the 1 st word in the set is in a list of allowed part-of-speech tags.
• The part-of-speech tag associated with the 2nd word in the set is in a list of allowed part-of-speech tags.
· The part-of-speech tag associated with the 3rd word in the set is in a list of allowed part-of-speech tags. • The 1 st word in the set satisfies a set of given character constraints (as described for the "one entity" generic matcher).
• The 2nd word in the set satisfies a set of given character constraints (as
described for the "one entity" generic matcher).
· The 3rd word in the set satisfies a set of given character constraints (as
described for the "one entity" generic matcher).
• The 1 st and 3rd words in the set share one of the relations given in a first
allowed relation set, but do not share a relation of the same type with any other word.
· The 2nd and 3rd words in the set share one of the relations given in a second allowed relation set, but do not share a relation of the same type with any other word (there is one exception to this rule, if the first and second allowed relation sets are the same then the 3rd word in the set is allowed share a relation of the same type with the 1 st and 2nd words in the set).
Instantiations of the "three related entities" generic matcher, therefore, require eight parameters to define the particulars of each of the constraints listed above. These are as follows:
• Three sets of allowed part-of-speech tags.
· Three sets of character constraints.
• A set of allowed relations between the 1 st and 3rd word.
• A set of allowed relations between the 2nd and 3rd word.
Examples of the active form of the past continuous tense are recognised using an instantiation of the previously described "two related entities" generic matcher. Note, this example use character constraints, as follows:
• Acceptable part-of-speech tags for 1 st word: <VBD> (a verb in the past tense)
• Acceptable part-of-speech tags for 2nd word: <VBG> (a verb in present
• participle form)
· 1 st word character constraints: None
• 2nd word character constraints: is not equal to "going"
• Allowed relations between 1 st and 2nd words: <AUX> (standard auxiliary) • Disallowed relations: <AUXPASS> (passive form auxiliary)
Therefore, an active form of past continuous is found using two parts of speech (verb at past tense and verb at present participle), and by ensuring the words found share an auxiliary relation, and no other auxiliary or passive auxiliary relation is shared from any of them.
Also, the second word must not be the word "going", since it would potentially be an example of a future event described in the past tense, e.g. "He was going to write the document'.
The grammar pattern for the passive form of the past continuous tense is defined a "three related entities" generic matcher with the parameters:
• Acceptable part-of-speech tags for 1 st word: <VBD> (a verb in the past tense) · Acceptable part-of-speech tags for 2nd word: <VBG> (a verb in present
participle form)
• Acceptable part-of-speech tags for 3rd word: <VBN> (a verb in past participle form)
• 1 st word character constraints: None
· 2nd word character constraints: None
• 3rd word character constraints: None
• Allowed relations between 1 st and 3rd words: <AUX> (standard auxiliary)
• Allowed relations between 2nd and 3rd words: <AUXPASS> (passive form
auxiliary)
A passive form of past continuous is found using three parts of speech (verb at past tense, verb at present participle and verb at past participle), and by ensuring the first and third words found share an auxiliary relation, and the second and third words found share a passive auxiliary relation. Also, none of the three words are involved in any other auxiliary or passive auxiliary relation. Again, the grammar feature function for the past continuous comprises a combination of the 2 grammar patterns defined above and can be applied to documents to highlight and tag instances of such a grammar feature within the document.
Example 3 - Non-reflexive Personal Pronouns
The non-reflexive personal pronouns are /, we, you, he, she, it, and they. Finding the occurrence of just these words, however, is not enough to determine their use as non-reflexive personal pronouns.
In order to determine that the occurrence of one of these words is being used as a non-reflexive personal pronoun, the PRP part-of-speech tag must also be used. So, the non-reflexive personal pronouns can be implemented simply using the "one entity" generic matcher with the following parameters:
• Acceptable part-of-speech tags: <PRP> (personal pronoun)
• Character constraints: contains {"/", "we", "you", "he", "she", "if', "they'}
• Disallowed relations: None
Once the database 20 has been populated a user interface application 22 can be provided to allow a user to search for suitable authentic texts within the database. The application 22 can be a stand-alone application or a web based application or indeed both types of application can be used to access the database 20.
As with conventional text search engines, the user interface, Figure 8, can comprise a search frame and a results frame. The search frame enables keywords to be entered in a text entry box 80 to limit or rank documents according to the number of instances of those words contained within the documents. A list box 82 can be provided to enable a user to select document complexity from, for example, A1 to C2. A meta-tag box 84 can be used to allow users to select documents to which meta- tags, e.g. Sport have been applied by other users of the system, so dividing documents within the database by subject. One or more date entry fields 86 can be provided to enable users to select documents acquired or published within or around a given time period. However, distinct from other search engines, the user interface application 22 of the present invention enables users to select one or more grammar features 88, for example, past continuous, present simple. When a user selects such a feature, the application 22 can now select documents including such features as well as any other criteria specified through the boxes 80-86. Clearly, many documents even of the same level of complexity or within a given subject and having given keywords, may include many instances of a given grammar feature. In such cases, the application 22 can rank selected documents Di...Dn according to the predominance of the feature within the document and display these accordingly within the results frame (with the option to go to the next results page if n>x).
So for example, a user searching for Sports articles at level B1 mentioning Ireland and France and using the past continuous, in the simplest case would be provided with a ranked list of documents in which the document with the densest use of the grammar feature would be ranked highest.
In its simplest form, a user simply selected a given document and is vectored to either the original web page from which the document was source or a locally cached version of the document - according to the applicable licensing arrangements. The user interface application 22 can of course be extended to apply any number of different post processing operations to documents provided from the database 20. For example, where a document was originally sourced from a web page, assuming the appropriate licensing arrangements are in place with the web page owner, the text of the document can be extracted from the various other information in the web page and instead the text displayed for the user can be enhanced, for example, to include a glossary of terms translating keywords of the document into a target language. Alternatively, the passages of text corresponding to the grammar features of the query can be highlighted. Again, alternatively, the document can be abstracted to condense the source document into passages in which a grammar feature is used.
In addition or alternatively, the application 22 can be arranged either to provide a print version of the document including for example exercise questions of the type outlined in the introduction above; or the application can include interactive
exercises, where for example, a user swipes with a pointer to identify text
corresponding to a given grammar feature i.e. the application 22 would retrieve suitable documents, but the user rather than the application would highlight the language.
As the system in the illustrated embodiment can continually acquire and process documents through the modules 14 and 18 both from the internet or any continually updated sources, the system is able to provide the most relevant authentic texts for teachers and students so maximizing their time spent teaching and learning a language.

Claims

Claims:
1 . A content retrieval system, said system including a grammatical analyzer cooperable with a set of grammar feature functions, at least one of said functions defining a pattern comprising: a required part-of-speech for at least one word of a sentence within a document, a set of character constraints for the word and a dependency relationship between said word and another word in said sentence within a document, said grammatical analyzer being arranged to apply at least one grammar feature function to a set of documents to identify documents including text conforming to said grammar feature function definition, said system further including a user interface component arranged to enable a user to specify at least one of said set of grammar feature functions and to provide to said user a filtered set of said documents according to the documents including text conforming to said grammar feature function.
2. The system of claim 1 wherein at least one of said functions comprises a plurality of patterns, at least one of said patterns comprising a required part-of- speech for at least one word of a sentence within a document, a set of character constraints for the word and a dependency relationship between said word and another word in said sentence within a document.
3. The system of claim 2 wherein said dependency relationship comprises one or both of a permitted relationship or non-permitted relationship.
4. The system of claim 2, wherein said character constraints comprise one or both of a permitted set of characters or a non-permitted set of characters.
5. The system of claim 1 wherein at least one of said functions comprises a pattern comprising a required part-of-speech for at least two words of a sentence within a document, a set of character constraints for the at least two words and a dependency relationship between said two words.
6. The system of claim 1 further comprising a tokenizer arranged to identify individual words within a document, a part of speech analyzer arranged to provide a part-of-speech tag to each word of said document and a dependency analyzer arranged to generate grammatical relationships between words of said document.
7. The system of claim 6 wherein said analyzers are arranged to output an analyzed document in XML format including said part-of-speech tags and said grammatical relationships.
8. The system of claim 1 wherein said system is arranged to obtain documents for analysis from the Internet and said grammatical analyzer is arranged to analyse and tag obtained documents for each defined grammar feature function and to store said analysis information in association with said document.
9. The system of claim 8 wherein said user interface component is arranged to use said stored analysis information in providing said filtered set of documents.
10. The system of claim 1 wherein said user interface component is arranged to display said filtered set ranked according to their relevance to said user specified grammar feature.
11 . The system of claim 1 wherein said user interface component is arranged to display a selected document for said filtered set in an enhanced manner.
12. The system of claim 11 wherein said enhancement comprises one or more of: highlighting text within said document confirming to said user specified grammar feature; abstracting said document prior to display; displaying a glossary for said document; or displaying a set of exercises for said document.
13. The system of claim 1 wherein said grammar feature functions define patterns for one of more of: an active form of a verb; a passive form of a verb; or a non- reflexive pronoun.
PCT/EP2011/061691 2010-10-07 2011-07-08 Content retrieval system WO2012045492A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IES2010/0646 2010-10-07
IE20100646 2010-10-07

Publications (1)

Publication Number Publication Date
WO2012045492A1 true WO2012045492A1 (en) 2012-04-12

Family

ID=44628828

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2011/061691 WO2012045492A1 (en) 2010-10-07 2011-07-08 Content retrieval system

Country Status (1)

Country Link
WO (1) WO2012045492A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6842730B1 (en) * 2000-06-22 2005-01-11 Hapax Limited Method and system for information extraction
US20060190242A1 (en) * 2005-02-22 2006-08-24 Educational Testing Service Method and system for automated item development for language learners

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6842730B1 (en) * 2000-06-22 2005-01-11 Hapax Limited Method and system for information extraction
US20060190242A1 (en) * 2005-02-22 2006-08-24 Educational Testing Service Method and system for automated item development for language learners

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
DAN KLEIN, CHRISTOPHER D. MANNING: "Accurate Unlexicalized Parsing", PROCEEDINGS OF THE 41ST MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2003, pages 423 - 430, XP055061192
DAN KLEIN, CHRISTOPHER D. MANNING: "Advances in Neural Information Processing Systems", vol. 15, 2002, MIT PRESS, article "Fast Exact Inference with a Factored Model for Natural Language Parsing", pages: 3 - 10
JENNY ROSE FINKEL, TROND GRENAGER, CHRISTOPHER MANNING: "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling", PROCEEDINGS OF THE 43ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2005, pages 363 - 370
K B HAASE: "Analogy in the Large", IJCAI'95 PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 2, 1995, San Francisco, CA, US, XP002661075, ISBN: 1-55860-363-8, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.1366&rep=rep1&type=pdf> [retrieved on 20111010] *
KRISTINA TOUTANOVA, CHRISTOPHER D. MANNING: "Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger", PROCEEDINGS OF THE JOINT SIGDAT CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND VERY LARGE CORPORA, 2000, pages 63 - 70
KRISTINA TOUTANOVA, DAN KLEIN, CHRISTOPHER MANNING, YORAM SINGER: "Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network", PROCEEDINGS OF HLT-NAACL, 2003, pages 252 - 259
S HOFFMANN, S EVERT: "BNCweb (CQP-edition): The marriage of two corpus tools", 2006, Lancaster, GB, XP002661074, Retrieved from the Internet <URL:http://corpora.lancs.ac.uk/BNCweb/Hoffmann-Evert.pdf> [retrieved on 20111010] *
STEFAN EVERT ET AL: "The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial", 17 February 2010 (2010-02-17), XP002661076, Retrieved from the Internet <URL:http://cwb.sourceforge.net/files/CQP_Tutorial.pdf> [retrieved on 20111010] *

Similar Documents

Publication Publication Date Title
Sawalha Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora
Sato et al. End-to-end argument generation system in debating
Svoboda et al. New word analogy corpus for exploring embeddings of Czech words
Zeroual et al. A new Quranic Corpus rich in morphosyntactical information
Bergenholtz Concepts for monofunctional accounting dictionaries
Kipfer Glossary of lexicographic terms
Zouaoui et al. A novel quranic search engine using an ontology-based semantic indexing
Pęzik Facets of prefabrication. Perspectives on modelling and detecting phraseological units
Li et al. A SkE-assisted comparison of three “prestige” near synonyms in Chinese
Atwell Using the Web to Model Modern and Qurʾanic Arabic
Palmer Borrowings, derivational morphology, and perceived productivity in English, 1300-1600
Safeena et al. Quranic computation: A review of research and application
Miller Analysing frequency lists
WO2012045492A1 (en) Content retrieval system
Temesgen Afaan Oromo News Text Summarization Using Sentence Scoring Method
Malema et al. Parts of speech tagging: A Setswana relative
Mason The automatic extraction of linguistic information from text corpora
Taavitsainen et al. Developments in English: Expanding electronic evidence
Moruz et al. Interlinking and Extending Large Lexical Resources for Romanian
Hosoda Hawaiian morphemes: Identification, usage, and application in information retrieval
Kilambi Compound words’ classification-a cognitive linguistic based study
Kilambi Compound Words-A Cognitive Linguistic Study
Kammani et al. A review of Quranic computation for e-learning
Yanfi et al. SPECIL: Spell Error Corpus for the Indonesian Language
Knapp et al. Multiple use of content in a web-based language learning system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11735411

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11735411

Country of ref document: EP

Kind code of ref document: A1