WO2015003143A2 - Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus - Google Patents

Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus Download PDF

Info

Publication number
WO2015003143A2
WO2015003143A2 PCT/US2014/045432 US2014045432W WO2015003143A2 WO 2015003143 A2 WO2015003143 A2 WO 2015003143A2 US 2014045432 W US2014045432 W US 2014045432W WO 2015003143 A2 WO2015003143 A2 WO 2015003143A2
Authority
WO
WIPO (PCT)
Prior art keywords
computer
text
corpus
relation
discourse
Prior art date
Application number
PCT/US2014/045432
Other languages
French (fr)
Other versions
WO2015003143A3 (en
Inventor
Blake HOWALD
Andrew NYSTROM
Original Assignee
Thomson Reuters Global Resources
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Reuters Global Resources filed Critical Thomson Reuters Global Resources
Priority to AU2014285073A priority Critical patent/AU2014285073B9/en
Priority to CA2917153A priority patent/CA2917153C/en
Publication of WO2015003143A2 publication Critical patent/WO2015003143A2/en
Publication of WO2015003143A3 publication Critical patent/WO2015003143A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • the present invention relates generally to human language/natural language processing (NLP), information retrieval and more particularly to predicting implicit rhetorical relations between spans of text within documents. Also, the invention relates to processes, software and systems for use in delivery of services related to the legal, corporate, accounting, research, educational, and other professional sectors. The invention relates to a system that presents searching functions to users, such as subscribers to a professional services related service, processes search terms and applies search syntax across document databases, and displays search results generated in response to the search function and processing.
  • NLP human language/natural language processing
  • Search engines are used to retrieve documents in response to user defined queries or search terms. To this end, search engines may compare the frequency of terms that appear in one document against the frequency of those terms as they appear in other documents within a database or network of databases. This aids the search engine in determining respective "importance" of the different terms within the document, and thus determining the best matching documents to the given query.
  • One method for comparing terms appearing in a document against a collection of documents is called Term Frequency- Inverse Document Frequency (TFIDF or TF-IDF).
  • TFIDF assigns a weight as a statistical measure used to evaluate tile importance of a word to a document in a collection of documents or corpus.
  • the relative "importance" of the word increases proportionally to the number of times or "frequency” such word appears in the document. The importance is offset or compared against the frequency of that word appearing in documents comprising the corpus.
  • TFIDF is expressed as the log (N/n(q)) where q is the query term, N is the number of documents in the collection and N(q) is the number of documents containing q.
  • TFIDF and variations of this weighting scheme are typically used by search engines, such as Google, as a way to score and rank a document's relevance given a user query.
  • search engines such as Google
  • the document may be ranked in relevance based on summing the scores associated with each term.
  • the documents responsive to the user query may be ranked and presented to the user based on relevancy as well as other determining factors.
  • SVM support vector machine
  • model primary or dual formulation
  • the model includes one or more inputs and one or more outputs, as well as one or more gains, each a respective partial derivative of an output with respect to a respective input.
  • the trained model may be used to control or manage the plant or process.
  • NLP Language Processing
  • Treebank is a large scale corpus annotated with information related to discourse structure and discourse semantics. While there are many aspects of discourse that are crucial to a complete understanding of natural language, the PDTB focuses on encoding discourse relations.
  • the annotation methodology follows a lexically-grounded approach.
  • the PDTB has strived to maintain a theory-neutral approach with respect to the nature of high-level representation of discourse structure, in order to allow the corpus to be usable within different theoretical frameworks.
  • Theory-neutrality is achieved by keeping annotations of discourse relations "low-level": Each discourse relations is annotated independently of other relations, that is, dependencies across relations are not marked.
  • the PDTB is a project aimed at supporting the extraction of a range of inferences associated with discourse relations, for a wide range of NLP applications, such as parsing, information extraction, question-answering, summarization, machine translation, generation, as well as corpus based studies in linguistics and psycholinguistics.
  • the PDTB project also aims to conduct empirical research with the PDTB corpus, for NLP as well as theoretical linguistics.
  • Discourse relations in the current version of the PDTB are taken to be triggered by explicit phrases or by structural adjacency.
  • Each relation is further annotated for its two abstract object arguments, the sense of the relation, and the attributions associated with the relation and each of its two arguments.
  • the annotations in the PDTB are aligned with the syntactic constituency annotations of the Penn Treebank.
  • Example (1) a. Pascale finished Fox in Sox.
  • the NARRATION (or TEMPORAL. SYNCHRONOUS. SUCCESSION in the PDTB) relation holds between the actions in (la-b) as (lb) follows (la) at event time.
  • the EXPANSION relation providing more information about Pascale and The Cat in the Hat, holds between (lb-c).
  • (lc) is temporally inclusive (subordinated) with (lb); there is no temporal progression at event time.
  • the CONTRAST relation (lc-d) is temporally inclusive as well and sets an expectation for a RESULT relation which holds between (ld-e), temporally following the event progression in (la-b).
  • Example (2) a. Pascale finished Fox in Sox.
  • RST contains VOLITIONAL and NON- VOLITIONAL CAUSE relations whereas SDRT only has CAUSE.
  • Previous machine learning tasks related to these theories report a wide range of prediction for all target rhetorical relations combined: 49.70% (6- way classifier) (Daniel Marcu and Abdessarnad Echihabi. 2002. An Unsupervised Approach to Recognizing Discourse Relations.
  • each PDTB annotation which holds between two spans of text (Argl, Arg2), indicates whether the relation is Explicit (3a) or Implicit (3c), the actual discourse marker if it is explicit - if it is implicit, the PDTB annotation provides an adjudicated marker that captures the relations because in (3 c).
  • Dialogue 117-125 - Wellner et al. (2006)); (8) The presence or absence of a modal verb, specific modal verbs and their cross-product ⁇ >; (9) Whether or not the implicit relation immediately follows or precedes and explicit relation (following Pitler et al. (2008)); and (10) Different variations of word pair models trained on the TextRels, PDTB implicit and explicit training sets - for example, word pairs contributing to the highest information gain for a given relation— the— but, of—but, to—but strongly associate with COMPARISON where the— and, a— and strongly associate with CONTINGENCY.
  • the present invention provides a method and system for simplifying rhetorical relation prediction in large scale annotated corpus or database. More particularly, even if discourse markers are missing, the invention can favorably achieve effective performance for rhetorical relation prediction. In one manner, the rhetorical structure (progression of relations) between Examples (1) and (2) above is arguably similar and open to wider interpretation, but recoverable.
  • the invention is described in connection with the PDTB, as it provides a wealth of robustly annotated Wall Street Journal (“WSJ”) data and has been the locus of comparative research in this area, the invention is not limited to PDTB.
  • the invention advances a line of research focused on predicting implicit rhetorical relations between two spans of text, for example in the Penn Discourse Treebank ("PDTB").
  • Rhetorical relations are a pragmatic feature of texts that are cued very strongly by an explicit discourse marker (e.g., but, when).
  • determining a rhetorical relation in the absence of an explicit discourse marker has proven to be quite difficult.
  • State of the art prediction relies on a myriad of surface level features designed to capture the pragmatic information encoded in the absent marker.
  • overall performance only achieves a macro-Fl between 36 and 40% for all relations combined.
  • the invention has demonstrated that using a simplified feature set based only on raw text and semantic dependencies meets or exceeds previous performance by up to 5% for all relations and up to 14% for certain individual relations.
  • Using surface level features to predict implicit rhetorical relations for the PDTB approaches a theoretical maximum performance, suggesting that more data will not necessarily improve performance based on these and similarly situated features.
  • the invention provides a computer-implemented method for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the method represented as instructions stored in memory for recall and processing by a processor such that when executed the method provides a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs.
  • the method having: identifying by use of a processor executing a set of code a first factor associated with a first relation and associated with a first span of text Argl and a second factor associated with a second relation and associated with a second span of text Arg2; and processing one or more of the following features: (1) sequence expressing the first and second relations as a normalized percentage; (2) text unigram, bigram and/or trigrams of Argl and Arg2; (3) unigram, bigram and trigram dependencies of Argl and Arg2; and (4) the occurrence of one or more of a date, time, location, person, money, percent, organization named entity.
  • the first embodiment may be further characterized in having one or more of the following additional features: the sequence of the first relation in a four relation discourse is approximately 0.250; the first and second spans of text Argl and Arg2 are part of an annotated corpus; the annotated corpus is one of the group consisting of the Penn Discourse Treebank ("PDTB"); Rhetorical Structure Theory corpus; and the Discourse Graph Bank; the annotated corpus is used to train a system to determine classifications; measuring performance relative to the annotated corpus to determine classifier acceptance; applying an accepted classifier to an un-annotated corpus; the first and second spans of text Argl and Arg2 are classified with a rhetorical label stored within the annotated corpus;
  • PDTB Penn Discourse Treebank
  • surface level features are used to capture pragmatic information encoded in the absent discourse marker; the one or more features comprises a simplified feature set based only on one or both of raw text and semantic dependencies; the rhetorical relation is represented in a hierarchy comprising one or more levels including one or more of class level, type level and subtype level; each level comprises a set of senses; the one or more levels includes a class level comprising the following set of senses: temporal, contingency, comparison and expansion; and the one or more levels includes a type level comprising a set of senses different from the class level set of senses.
  • the invention provides a computer-based system for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the system comprising a processor, a memory, a user interface and a display.
  • the system further having: a set of instructions stored in the memory and when executed by the processor adapted to provide a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs; identifying by use of a processor executing a set of code a first factor associated with a first relation and associated with a first span of text Argl and a second factor associated with a second relation and associated with a second span of text Arg2; a rhetorical relation module comprising a set of code when executed by the processor adapted to process one or more of the following features: (1) sequence expressing the first and second relations as a normalized percentage; (2) text unigram, bigram and/or trigram
  • the invention provides a computer-implemented method for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the method represented as instructions stored in memory for recall and processing by a processor such that when executed the method provides a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs.
  • the method includes:
  • Figure 1 is a block diagram illustrating one embodiment of the Rhetorical
  • Relation Analyzer/Predictor implemented in a document retrieval system architecture according to the present invention.
  • Figure 2 is a block diagram further illustrating a system architecture for implementing the embodiment of Figure 1.
  • Figure 3 is a graphical representation of actual points plotted in a macro-Fl score vs. training instance count graph in connection with the present invention.
  • the present invention provides a method and system for simplifying rhetorical relation prediction in a large scale annotated corpus or database.
  • annotated corpora While much is described in the context of PDTB as the exemplary corpus, the invention is not limited to PDTB and may be used with beneficial effect generally with annotated corpora.
  • other annotated corpora include the Rhetorical Structure Theory corpus and the Discourse Graph Bank. These are both academic corpora similar to the PDTB.
  • the annotated corpus is used to train a system to figure out good from bad classifications.
  • performance relative to the annotated corpus i.e., how many did the subject classifier get right, how many did it get wrong.
  • Multiple annotated corpora may be used to arrive at the desired features and classifications.
  • the inventive method may be applied to an un- annotated corpus, such as commercial and proprietary corpora, e.g., the Thomson Reuters News Archive.
  • an additional point to make here is that Argl and Arg2 correspond simply to two spans of text. The spans are considered "arguments" and can be sentences or phrases.
  • the PDTB calls them Argl , Arg2, but more generally, for other annotated corpora and un- annotated corpora, the method will identify two spans of text and attempt to classify them with the appropriate rhetorical label.
  • the invention can favorably achieve effective performance for rhetorical relation prediction.
  • the rhetorical structure (progression of relations) between Examples (1) and (2) above is arguably similar and open to wider interpretation, but recoverable.
  • the invention is described in connection with the PDTB, as it provides a wealth of robustly annotated Wall Street Journal (“WSJ") data and has been the locus of comparative research in this area, the invention is not limited to PDTB.
  • the ability to predict implicit relations (39% of the annotated relations) has proven to be quite difficult compared to their explicit counterparts.
  • system 100 provides a framework for searching, retrieving, analyzing, and ranking claims and/or documents.
  • System 100 may be used in conjunction with a system offering of a professional services provider, e.g., West Services Inc., a part of Thomson Reuters Corporation, and in this example includes a Central Network Server/Database Facility 101 comprising a Network Server 102, a Proprietary Corpora Database, e.g., Thomson Reuters News Archive, 103, a Document Retrieval System 104 having as components a Rhetorical Relations Analyzer (RRA) 105, a Feature Extraction module 106, a Machine Learning Module (e.g., SVM), 107 and a Machine Learning Algorithm
  • RRA Rhetorical Relations Analyzer
  • SVM Machine Learning Module
  • Feature Extraction Module 106 creates features relevant for classification.
  • Machine Learning Module 107 includes algorithms and processes for performing any of one or more machine learning approaches/techniques. Although the exemplary embodiments described herein often refer to support vector machine "SVM" the invention is not limited to this approach. For example, and not by way of limitation, in addition to SVM the Machine Learning Module 107 may use or include Naive Bayes and Decision Tree classification algorithms as are well known in the art. Machine Learning Testing/Training Data Module 108 allows the user to test the performance of multiple machine learning
  • the invention creates features that could, in theory, be used with any machine learning algorithm.
  • the invention may be used as follows: (1) create features relevant for classification; (2) test multiple machine learning algorithms against training data, e.g., against known annotated corpus such as PDTB; (3) measure and record performance of the tested machine learning algorithms; (4) select the preferred machine learning algorithm; and (5) apply the selected preferred machine learning algorithm to a proprietary corpus, e.g., Thomson Reuters News Archive.
  • the remote user system 109 in this example includes a GUI interface operated via a computer 110, such as a PC computer or the like, that may comprise a typical combination of hardware and software including, as shown in respect to computer 110, system memory 112, operating system 114, application programs 116, graphical user interface (GUI) 118, processor 120, and storage 122 which may contain electronic information 124 such as electronic documents.
  • GUI graphical user interface
  • the methods and systems of the present invention, described in detail hereafter, may be employed in providing remote users access to a searchable database.
  • remote users may search a patent document database using search queries based on patent claims to retrieve and view patent documents of interest.
  • the invention provides scoring and ranking processes that facilitate an efficient and highly effective, and much improved, searching and retrieving operation.
  • Client side application software may be stored on machine-readable medium and comprising instructions executed, for example, by the processor 120 of computer 1 10, and presentation of web-based interface screens facilitate the interaction between user system 109 and central system 101.
  • the operating system 114 should be suitable for use with the system 101 and browser functionality described herein, for example, Microsoft Windows Vista (business, enterprise and ultimate editions), Windows 7, or Windows XP Professional with appropriate service packs.
  • the system may require the remote user or client machines to be compatible with minimum threshold levels of processing capabilities, e.g., Intel Pentium III, speed, e.g., 500 MHz, minimal memory levels and other parameters.
  • Central system 101 may include a network of servers, computers and databases, such as over a LAN, WLAN, Ethernet, token ring, FDDI ring or other communications network infrastructure. Any of several suitable communication links are available, such as one or a combination of wireless, LAN, WLAN, ISDN, X.25, DSL, and ATM type networks, for example.
  • Software to perform functions associated with system 101 may include self-contained applications within a desktop or server or network environment and may utilize local databases, such as SQL 2005 or above or SQL Express, IBM DB2 or other suitable database, to store documents, collections, and data associated with processing such information.
  • the various databases may be a relational database.
  • relational databases various tables of data are created and data is inserted into, and/or selected from, these tables using SQL, or some other database-query language known in the art.
  • a database application such as, for example, MySQLTM, SQLServerTM, Oracle 81TM, 10GTM, or some other suitable database application may be used to manage the data.
  • SQL Object Relational Data Schema
  • FIG. 2 an exemplary representation of a machine in the example form of a computer system 200 within which a set of instructions may be executed to cause the machine to perform any one or more of the methodologies discussed herein.
  • the system 200 may be used to implement the Document Retrieval System 104 of Figure 1 and/or components of that system, e.g.,
  • the machine operates as a standalone device or may be connected (e.g., networked) to other machines.
  • the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
  • the machine may comprise a server computer, a client computer, a personal computer (PC), a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
  • PC personal computer
  • the term "machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
  • the example computer system 200 includes a processor 202 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both), a main memory 204 and a static memory 506, which communicate with each other via a bus 508.
  • the computer system 200 may further include a video display unit 210, a keyboard or other input device 212, a cursor control device 214 (e.g., a mouse), a storage unit 216 (e.g., hard-disk drive), a signal generation device 218, and a network interface device 220.
  • the storage unit 216 includes a machine-readable medium 222 on which is stored one or more sets of instructions (e.g., software 224) embodying any one or more of the methodologies or functions illustrated herein.
  • the software 224 may also reside, completely or at least partially, within the main memory 204 and/or within the processor 202 during execution thereof by the computer system 200, the main memory 204 and the processor 202 also constituting machine -readable media.
  • the software 224 may further be transmitted or received over a network 226 via the network interface device 220.
  • machine-readable medium 222 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
  • the term “machine- readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention.
  • the term “machine -readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals.
  • Argl Text "Anyway ZBB"s procedures were so cumbersome that everyone involved was crushed under a burden of marginalia;" Argl NER: NULL; Argl Dependency: ADVMOD POSS NSUBJ COP ADVMOD ROOT COMPLM NSUBJPASS PARTMOD AUXPASS CCOMP DET PREP UNDER PREP OF; Arg2 Text: A strategic review is fundamentally different; Arg2 NER: NULL; Arg2 Dependency: DET AMOD NSUBJ COP ADVMOD ROOT.
  • Argl Text But the pool of federal emergency-relief funds already is running low because of the heavy costs of cleaning up Hurricane Hugo and Congress will be under pressure to allocate more money quickly; Argl NER: ORGANIZATON; Arg 1 Dependency: DET NSUBJ AMOD NN
  • Combination was based on (1) unigram and bigram combined dependencies; (2) bigram dependencies; (3) NER; and (4) unigram and bigram texts. Ultimately, this is a very simple set of features - basically different combinations of text and dependencies. If NER is not included, macro-Fl is 41.08 which still outperforms Pitler et al. (2009) and Zhou et al. (2010) and would represent a favorable drop in feature processing complexity. System Feature Economic meets Zhou et al. (2009) using only (1) combined dependency bigrams; (2) individual dependency unigrams; and (3) text unigrams.
  • COMPARISON .CONCESSION COMPARISON .CONCESSION
  • EXPANSJON.ALTERNATIVE COMPARISON .CONCESSION
  • COMPARISON and TEMPORAL relations performed comparatively better in our system at the Class level.
  • EXPANSION had the most positively contributing features at 48%; followed by CONTINGENCY (40%), COMPARISON (33%) and TEMPORAL (21 %). Conversely, TEMPORAL had the highest proportion of features that negatively contributed at 73%; followed COMPARISON (63%), CONTINGENCY (57%) and EXPANSION (49%). For all Class level relations, 2-5%> of features did not contribute. [0060] Table 5 indicates that while different Class level features rely on a range of positively contributing features (21-48% of all dimensions), each Class relies on a very similar distribution of those dimensions with, for an individual Class classification, roughly 20cfo relying on combined dependencies (Comb. Dep.).
  • Stop words appear to play an important role in the other relations as well: EXPANSION- from, has, DET (determiner); CONTINGENCY -you. is. these, that, can for, and COMPARISON - AUX, DET, CONJUNCTIVE OR.
  • the role of stop words and the contribution in implicit relation prediction has been observed in Marcu and Echihabi (2002) and Blair-Goldensohn et al. (2007) - in particular, that removing them from the corpus hurts performance.
  • Some text features reveal facts about the corpus, but will have weak generalizeability. For example, market, investors in CONTINGENCY, mr. in TEMPORAL and rose in EXPANSION.
  • TEXT ARG1 DEP ARG2 prep CDEP abbrev was from TEXT ARG2 is dobj
  • TEXT ARG2 mr TEXT ARG1 has TEXT ARG2 sell mark
  • N is the number of training examples.
  • C is the in-sample error. As N approaches infinity, only C contributes to the error. The is because with an infinite amount of data, everything is in-sample. It also makes sense because if you take the limit as N approaches infinity you're only left with C. Note also that the limit of E as N approaches infinity is C. So if we can calculate C, we know the theoretical error if we had an infinite amount of data (Note that if we get VC wrong, a different a will be learnt, but C will remain the same).
  • the invention improves performance on a simple and easily implementable feature set for implicit rhetorical relation prediction in the PDTB.
  • the feature engineering in accord with the invention was drastically reduced compared to prior systems and did not require any special processing on the corpus other than running of the dependency parser. Computationally, the system of the invention is very efficient in this respect.
  • inventive concepts may be automatically or semi- automatically, i.e., with some degree of human intervention, performed.
  • present invention is not to be limited in scope by the specific embodiments described herein. It is fully contemplated that other various embodiments of and modifications to the present invention, in addition to those described herein, will become apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such other embodiments and modifications are intended to fall within the scope of the following appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention provides a method and system directed to predicting implicit rhetorical relations between two spans of text, e.g., in a large annotated corpus, such as the Penn Discourse Treebank ("PDTB"), Rhetorical Structure Theory corpus, and the Discourse Graph Bank, and particularly directed to determining a rhetorical relation in the absence of an explicit discourse marker. Surface level features may be used to capture pragmatic information encoded in the absent marker. In one manner a simplified feature set based only on raw text and semantic dependencies is used to improve performance for all relations. By using surface level features to predict implicit rhetorical relations for the large annotated corpus the invention approaches a theoretical maximum performance, suggesting that more data will not necessarily improve performance based on these and similarly situated features.

Description

METHOD AND SYSTEM FOR SIMPLIFYING IMPLICIT RHETORICAL
RELATION PREDICTION IN LARGE SCALE ANNOTATED CORPUS
FIELD OF THE INVENTION
[0001] The present invention relates generally to human language/natural language processing (NLP), information retrieval and more particularly to predicting implicit rhetorical relations between spans of text within documents. Also, the invention relates to processes, software and systems for use in delivery of services related to the legal, corporate, accounting, research, educational, and other professional sectors. The invention relates to a system that presents searching functions to users, such as subscribers to a professional services related service, processes search terms and applies search syntax across document databases, and displays search results generated in response to the search function and processing.
BACKGROUND OF THE INVENTION
[0002] With the advents of the printing press, typeset, typewriting machines, computer-implemented word processing and mass data storage, the amount of information generated by mankind has risen dramatically and with an ever quickening pace. As a result there is a continuing and growing need to collect and store, identify, track, classify and catalogue for retrieval and distribution this growing sea of information. One traditional form of cataloging and classifying information, e.g., books and other writings, is the Dewey Decimal System. Increasingly, the world's economies and supporting infrastructures, including research systems, are becoming global in nature and as systems allow for cross- lingual searching information available to researchers continues to expand. A growing field of research and development is in the area of extracting relationships and other metadata about documents based on terms or patterns or discerned attributes among documents in large databases. By deriving relationship information systems can draw conclusions and connections between documents, authors, subjects and events that aid users in researching and other efforts.
[0003] In many areas and industries, including the financial and legal sectors and areas of technology, for example, there are content and enhanced experience providers, such as The Thomson Reuters Corporation. Such providers identify, collect, analyze and process key data for use in generating content, such as law related reports, articles, etc., for consumption by professionals and others involved in the respective industries, e.g., lawyers, accountants, researchers. Providers in the various sectors and industries continually look for products and services to provide subscribers, clients and other customers and for ways to distinguish their firms over the competition. Such provides strive to create and provide enhance tools, including search and ranking tools, to enable clients to more efficiently and effectively process information and make informed decisions.
[0004] For example, with advancements in technology and sophisticated approaches to searching across vast amounts of data and documents, e.g., database of legal documents or records, published articles or papers, etc., professionals and other users increasingly rely on mathematical models and algorithms in making professional and business determinations. Existing methods for applying search terms across large databases of documents have room for considerable improvement as they frequently do not adequately focus on the key information of interest to yield a focused and well ranked set of documents to most closely match the expressed searching terms and data. Although such computer-based systems have shortcomings, there has been significant advancement over searching, identifying, filtering and grouping documents by hand, which is prohibitively time-intensive, costly, inefficient, and inconsistent.
[0005] Search engines are used to retrieve documents in response to user defined queries or search terms. To this end, search engines may compare the frequency of terms that appear in one document against the frequency of those terms as they appear in other documents within a database or network of databases. This aids the search engine in determining respective "importance" of the different terms within the document, and thus determining the best matching documents to the given query. One method for comparing terms appearing in a document against a collection of documents is called Term Frequency- Inverse Document Frequency (TFIDF or TF-IDF). In this method a percentage of term count as compared to all terms within a subject document is assigned (as a numerator) and that is divided by the logarithm of the percentage of documents in which that term appears in a corpus (as the denominator). More specifically, TFIDF assigns a weight as a statistical measure used to evaluate tile importance of a word to a document in a collection of documents or corpus. The relative "importance" of the word increases proportionally to the number of times or "frequency" such word appears in the document. The importance is offset or compared against the frequency of that word appearing in documents comprising the corpus. TFIDF is expressed as the log (N/n(q)) where q is the query term, N is the number of documents in the collection and N(q) is the number of documents containing q. TFIDF and variations of this weighting scheme are typically used by search engines, such as Google, as a way to score and rank a document's relevance given a user query. Generally for each term included in a user query, the document may be ranked in relevance based on summing the scores associated with each term. The documents responsive to the user query may be ranked and presented to the user based on relevancy as well as other determining factors.
[0006] With regards to training an SVM, Published Pat. App. US2007/0282766
(Hartman et al.) entitled Training a Support Vector Machine With Process Constraints, which is hereby incorporated herein in the entirety, describes a system and method for training a support vector machine (SVM) and particularly a model (primal or dual formulation) implemented with an SVM and representing a plant or process with one or more known attributes. Process constraints that correspond to the known attributes are specified, and the model trained subject to the one or more process constraints. The model includes one or more inputs and one or more outputs, as well as one or more gains, each a respective partial derivative of an output with respect to a respective input. In the manner described, the trained model may be used to control or manage the plant or process.
[0007] More particularly in NLP pursuits, the rhetorical relations that hold between clauses in discourse 1) minimally index temporal and event information, and 2) contribute to a discourse's pragmatic coherence (Andrew Kehler, Coherence, Reference, and the Theory of Grammar, CSLI Publications, Stanford, CA, 2002; Jerry R. Hobbs, On The Coherence and Structure of Discourse, CSLI Technical Report, CSLI-85-37, 1985). From a Natural
Language Processing (NLP) perspective, being able to recover the discourse structure of a text has been motivated by the improvement it affords to discourse processing tasks such as natural language generation (Eduard H. Hovy, Automated Discourse Generation Using Discourse Structure Relations, Artificial Intelligence 63, 341-385, 1993) and text
summarization (Daniel Marcu, Improving Summarization Through Rhetorical Parsing Tuning, Proceedings of The 6th Workshop on Very Large Corpora, 206-215, 1998). In a 2002, paper Schilder describes a simple discourse parsing and analysis algorithm that combines a formal under-specification utilizing discourse grammar with Information
Retrieval (IR) techniques. Frank Schilder, Robust Discourse Parsing via Discourse Markers, Topicality and Position. Natural Language Engineering, 2002, Vol. 8, Issue 2-3, pages 235- 255. [0008] As described at http://www.seas.upenn.edu/~ dtb website, the Penn Discourse
Treebank (PDTB) is a large scale corpus annotated with information related to discourse structure and discourse semantics. While there are many aspects of discourse that are crucial to a complete understanding of natural language, the PDTB focuses on encoding discourse relations. The annotation methodology follows a lexically-grounded approach. The PDTB has strived to maintain a theory-neutral approach with respect to the nature of high-level representation of discourse structure, in order to allow the corpus to be usable within different theoretical frameworks. Theory-neutrality is achieved by keeping annotations of discourse relations "low-level": Each discourse relations is annotated independently of other relations, that is, dependencies across relations are not marked.
[0009] The PDTB is a project aimed at supporting the extraction of a range of inferences associated with discourse relations, for a wide range of NLP applications, such as parsing, information extraction, question-answering, summarization, machine translation, generation, as well as corpus based studies in linguistics and psycholinguistics. The PDTB project also aims to conduct empirical research with the PDTB corpus, for NLP as well as theoretical linguistics. Discourse relations in the current version of the PDTB are taken to be triggered by explicit phrases or by structural adjacency. Each relation is further annotated for its two abstract object arguments, the sense of the relation, and the attributions associated with the relation and each of its two arguments. The annotations in the PDTB are aligned with the syntactic constituency annotations of the Penn Treebank.
[0010] Two documents that describe the PDTB-2.0 corpus and PDTB annotation guidelines, annotation format, and summary distributions are: 1) Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi and Bonnie Webber, The Penn Discourse Treebank 2.0, Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco; and 2) The PDTB Research Group. 2008, The PDTB 2.0. Annotation Manual, Dec. 17, 2007, both available at the
http://www.seas.upenn.edu/~pdtb website.
[0011] Focusing on the PDTB, the ability to predict rhetorical relations explicitly cued with a discourse marker (45% of the annotated relations in the PDTB) is very straight forward from a machine learning perspective. For example, Emily Pitler, Mridhula
Raghupathy, Hena Mehta, Ani Nenkova, Alan Lee and Aravind Joshi, Easily Identifiable Discourse Relations, Proceedings of the 22nd international Conference on Computational Linguistics (COLJNG-08), 2008, achieved a 93.09% four- way accuracy based on the explicit marker alone (predicting four rhetorical relation class TEMPORAL, EXPANSION,
COMPARISON and CONTINGENCY). The Pitler (2008).
Consider (1):
Example (1) a. Pascale finished Fox in Sox.
b. Then she walked to the bookcase to get The Cat in the Hat, c. which is her favorite book.
d. But the book was too high to reach.
e. So she grabbed Green Eggs and Ham. [0012] In (1), the NARRATION (or TEMPORAL. SYNCHRONOUS. SUCCESSION in the PDTB) relation holds between the actions in (la-b) as (lb) follows (la) at event time. The EXPANSION relation, providing more information about Pascale and The Cat in the Hat, holds between (lb-c). (lc) is temporally inclusive (subordinated) with (lb); there is no temporal progression at event time. The CONTRAST relation (lc-d) is temporally inclusive as well and sets an expectation for a RESULT relation which holds between (ld-e), temporally following the event progression in (la-b).
[0013] The correspondence of these relations to the explicit discourse markers - e.g., then (1 b), which (lc ), but (Id) and so (le) - is both obvious (i.e., part of the pragmatic system of English) and systematic. However, in the absence of an explicit marker, rhetorical relations must be inferred either from the content of clauses themselves (e.g., what is described and how) or some pragmatic phenomenon (e.g., clause position relative to other clauses, variance in specificity of reference, etc.). To illustrate, consider (2):
Example (2) a. Pascale finished Fox in Sox.
b. She walked to the bookcase to get The Cat in the Hat,
c. Her favorite book.
d. The book was too high to reach.
e. She grabbed Green Eggs and Ham [0014] If markers are missing, the rhetorical structure (progression of relations) between (1) and (2) is arguably similar and open to wider interpretation, but recoverable. In the PDTB, the ability to predict implicit relations (39% of the annotated relations) has proven to be quite difficult compared to their explicit counterparts. For example, (Emily Pirler, Annie Louis and Ani Nenkova. 2009. Automatic Sense Prediction for Implicit Discourse Relations in Texr. In Proceedings of the Association for Computational Linguistics and the international Joint Conference on Natural Language Processing of the Asian Federation of Natural Ltlnguage Processing (ACL-IJCNLP-09) 683-691 - Pitler (2009)) and (Zhi-Min Zhou and Yu Xu and Zheng- Yu Niu and Man Lan and Jian Su and Chew Lim Tan. 2010. Predicting Discourse Connectives for Implicit Discourse Relation Recognition. In
Proceedings of the 2010 International Conference on Computational Linguistics, Poster Volume, 1507-1514 - Zhou (2010)) achieve between a 36.24 and 40.88 macro-F 1 for four rhetorical relation classes based on 10-12 features. This is a significant increase in complexity for mediocre performance.
[0015] This following is background on discourse structure, the PDTB and the current state of implicit relation prediction. There are several different theories of rhetorical relations and the structure of texts (e.g., Discourse Structure Theory (Grosz and Sidner, 1986), Rhetorical Structure Theory ('"RST") (Mann and Thompson, 1987) and Segmented Discourse Representation Theory ("SDRT") (Asher and Lascarides, 2003)). Depending on the theory, there can be a range of theoretically informed predetermined relations (e.g., RST contains roughly 30 relations whereas SDRT contains only about 12). However, any given inventory of rhetorical relations covers the same type of pragmatic phenomenon with varying degrees of specificity and generality. For example, RST contains VOLITIONAL and NON- VOLITIONAL CAUSE relations whereas SDRT only has CAUSE. Previous machine learning tasks related to these theories report a wide range of prediction for all target rhetorical relations combined: 49.70% (6- way classifier) (Daniel Marcu and Abdessarnad Echihabi. 2002. An Unsupervised Approach to Recognizing Discourse Relations. In
Proceedings of the Association of Computational Linguistics (ACL-02) 2002, 368-375 - Marcu (2002)); 57.55% (5-way) (Caroline Sporleder and Alex Lascarides. 2005. Exploiting Linguistic Cues to Classify Rhetorical Relations. In Proceedings of Recent Advances in Natural Language Processing (RANLP-05), 532-539 - Sporleder (2005)); and 70.707 {, 8 way (sentence internal relations)) (Mirella Lapata and Alex Lascarides. 2004. Inferring Sentence Internal Temporal Relations. In Proceedings of the North American Association of
Computational Linguistics (NAACL-04) 2004, 153-160 - Lapata (2004)) and individual relations - e.g., CONTRAST (43.64%); CONDITION (69%) and ELABORATION (82%) (Sporleder (2005)).
[0016] For purposes of describing the background efforts, "rhetorical relations" may be used interchangeably with "sense" (and indicated with SMALL CAPS) as this is the preferred term in the PDTB. The PDTB draws inspiration from the previously mentioned theories of discourse, but does not adopt a specific framework. Rather, the PDTB centrally relies upon the ability of humans to recognize (and agree to) senses whether indexed explicitly with a discourse marker or not (implicit).
[0017] There are over 40 senses assignable in the PDTB which exist in a collapsible hierarchy. At the highest (Class) level, there are 4 senses: TEMPORAL, CONTINGENCY, COMPARISON and EXPANSION. One level down (Type), there are 16 additional senses. At the lowest (Subtype) level, there are 23 additional senses. For sake of space, the full hierarchy is not presented here (see generally, (Prasad et al, 2008)), but the hierarchy is expressed in the sense name as CLASS.TYPE. SUBTYPE. An example PDTB annotation from WSL0790 is in Example (3):
Example (3) a. Explicit, but, COMPARISON, CONTRAST
As a critique of middle-class mores, the story is heavy-handed but its unsentimental sketches of Cairo life are vintage Mahfouz
b ....
c. Implicit, because, CONTINGENCY. CAUSE. REASON
The prose is closer to Balzac's "Pere Goriot" than it is to "Arabian Nights" (because) it is Mahfouz began writing when there was no novelistic tradition in Arabic [0018] In Example (3), each PDTB annotation, which holds between two spans of text (Argl, Arg2), indicates whether the relation is Explicit (3a) or Implicit (3c), the actual discourse marker if it is explicit - if it is implicit, the PDTB annotation provides an adjudicated marker that captures the relations because in (3 c). Alternative Lexicaliztions (AltLex), No Relations (NoRel) and Entity Relations (EntRel) are also annotated in the PDTB but are not considered in this description as it is assumed that there is always a relation between clauses and that entity relations are part and parcel of the pragmatic determination of the rhetorical relation The sense label to it's appropriate Class, Type or Subtype level, and the related text spans. The Source, Type, Determinacy and Scopal Polarity attributions of the arguments are also given in the PDTB annotation but are not included in the description herein.
[0019] As mentioned Section 1.0, Pitler et al. (2008) report results for the four PDTB
Class senses and, based solely on the type of explicit marker, achieves a 93.09% four- way accuracy. The fact that there is a highly systematic relationship between discourse markers and the conveyed pragmatic relationship suggests that being able to determine a rhetorical relation in the absence of the marker, i.e. based on the surface content coupled with an individual's ability to draw inferences and make assumptions about discourse structure, is a computationally difficult task.
[0020] Pitler et al.'s (2009) system relies on ten different feature sets: (1) Sentiment polarity tags between spans of text (hereinafter "Argl" and "Arg2"); (2) "Inquirer" tags from the General Inquirer lexicon (Philip J. Stone and Dexter C. Dunphy and Marshall S. Smith and Daniel M. Ogilvie. 1996. The General Inquirer: A Computer Approach to Content Analysis MIT Press, Cambridge, Mass. - Stone et al. (1996)) which provides finer grained distinctions for polarity and some semantic and pragmatic characterizations; (3) Reference to money, percentages or numbers - potentially indicating a comparison; (4) Ranked text unigrarn and bigrams most likely associated with a given relation from the PDTB implicit training set; (5) Ranked text unigrarn and bigrams most likely associated with a given relation from an explicitly marked training set (TextRels corpus (Sasha Blair-Goldensohn and Kathleen R. McKeown and Owen C. Rambow 2007. Building and Refining Rhetorical- Semantic Relation Models In Proceedings of NAA CL-HL T (NAACL 2007), 428—435 - Blair- Goldensohn et al. (2007)); (6) Verb classifications (Beth Levin 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press. Chicago. IL - Levin, (1993)) and their association with different relations; (7) The first and last words of a relations arguments as well as the first three words (following Ben Wellner and James Pustejovsky and Catherine Havasi and Anna Rumshisky and Roser Sauri. 2006.
Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources. In Proceedings of the 7th SIGdial Workshop on Discourse and
Dialogue, 117-125 - Wellner et al. (2006)); (8) The presence or absence of a modal verb, specific modal verbs and their cross-product<>; (9) Whether or not the implicit relation immediately follows or precedes and explicit relation (following Pitler et al. (2008)); and (10) Different variations of word pair models trained on the TextRels, PDTB implicit and explicit training sets - for example, word pairs contributing to the highest information gain for a given relation— the— but, of—but, to—but strongly associate with COMPARISON where the— and, a— and strongly associate with CONTINGENCY.
[0021] All of these features are designed to get at pragmatic information via surface text and associated semantic information. In four binary classification tasks (i.e.,
COMPARISON or not, etc.), the best feature combination is the use of first and last words as well as the first three words (Native Bayes). The macro-Fl for the four binary classifiers based on this feature is 34.23. Individual relation Fls are: COMPARISON=21.01; CONTINGENCY=36.75; EXPANSION=63.22; TEMPORAL=15.93. By adding different combinations of word-pair relations, performance improved for different relations in the binary classification tasks; raising the macro-Fl 6% to 40.56.
[0022] Lin et al. (2009) relies on more consolidated features: (I) Contextual features focused on argument embedding between the previous, current and next arguments; (2)
Syntactic constituent parses; (3) Dependency parses (using the Stanford parser (de Marneffe et al., 2006)); and (4) Stemmed word pairs from Argl and Arg2 in the PDTB. Both the Class and Type level of relations are predicted using these features. The best individual feature performance (OpenNLP MaxEnt) at the Class level is 30.3-32.9% for the word pairs.
Combining all features returns 35.0-40.2% accuracy at the Class level. At the Type level, Lin et al.'s system was able to predict 7 of 11 relations. While the prediction of the 7 or 11 Type relations averages to a 40% micro-average, the macro-Fl is between 20.36. Zhou et al. (2010) use a combination of features from Pitler et al. (2009), Lin et al. (2009) and intra-argument word pairs Saito et al. (2006). Zhou et al.'s system makes predictions at the Class level (four linear SVMs from LibSVM (Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3 ), 21 : 1-27:27 - Chang et al. (2011)). Macro-Fl is similar (40.88) is 4% better than Pitler et al.'s best single feature classifier (34.23-36.24) and 2% (42.34) better than Pitler et al.'s best combined system (40.56). [0023] In sum, for predicting implicit in the PDTB, the state of the art research returns macro-Fls that top out at a little more than 40% if different feature and classifier performances are combined and mid-30% for single feature set results. Further, all of the features are based on detecting semantic (and some syntactic) information on the assumption that it systematically co-varies with pragmatic rhetorical relations. Like many tasks attempting to predict the same, sensibly relying on the available text shows small incremental improvement over time, but within a window that, overall, runs counter to being able to actually use discourse structure information in downstream NLP tasks (Lin et al., 2009). The next section presents the methodology for our experiments which duplicate (and in some cases exceed) these results with significantly less (but higher dimensional) features botll in terms of amount and processing effort.
SUMMARY OF THE INVENTION [0024] To address the shortcomings of existing systems and to satisfy the present and long felt need of the marketplace, the present invention provides a method and system for simplifying rhetorical relation prediction in large scale annotated corpus or database. More particularly, even if discourse markers are missing, the invention can favorably achieve effective performance for rhetorical relation prediction. In one manner, the rhetorical structure (progression of relations) between Examples (1) and (2) above is arguably similar and open to wider interpretation, but recoverable. Although the invention is described in connection with the PDTB, as it provides a wealth of robustly annotated Wall Street Journal ("WSJ") data and has been the locus of comparative research in this area, the invention is not limited to PDTB. In the PDTB, the ability to predict implicit relations (39% of the annotated relations) has proven to be quite difficult compared to their explicit counterparts. For example, Pitler et al. (2009) and Zhou et al. (2010), achieve between 36.24 and 40.88 macro- Fl for four rhetorical relation classes based on 10-12 features. This is a significant up-tick in complexity for mediocre performance. [0025] Testing shows F-score results that are similar and exceed the current state of the art are actually achievable with a simple set of features - text unigrams and a combined dependency parse. Further, as it pertains to these features for the PDTB and the proposed parameters of the classifier, learning rates suggest that this is as close to the best that can be achieved for this task. [0026] The invention advances a line of research focused on predicting implicit rhetorical relations between two spans of text, for example in the Penn Discourse Treebank ("PDTB"). Rhetorical relations are a pragmatic feature of texts that are cued very strongly by an explicit discourse marker (e.g., but, when). However, determining a rhetorical relation in the absence of an explicit discourse marker has proven to be quite difficult. State of the art prediction relies on a myriad of surface level features designed to capture the pragmatic information encoded in the absent marker. However, overall performance only achieves a macro-Fl between 36 and 40% for all relations combined. The invention has demonstrated that using a simplified feature set based only on raw text and semantic dependencies meets or exceeds previous performance by up to 5% for all relations and up to 14% for certain individual relations. Using surface level features to predict implicit rhetorical relations for the PDTB approaches a theoretical maximum performance, suggesting that more data will not necessarily improve performance based on these and similarly situated features. [0027] In a first embodiment, the invention provides a computer-implemented method for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the method represented as instructions stored in memory for recall and processing by a processor such that when executed the method provides a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs. The method having: identifying by use of a processor executing a set of code a first factor associated with a first relation and associated with a first span of text Argl and a second factor associated with a second relation and associated with a second span of text Arg2; and processing one or more of the following features: (1) sequence expressing the first and second relations as a normalized percentage; (2) text unigram, bigram and/or trigrams of Argl and Arg2; (3) unigram, bigram and trigram dependencies of Argl and Arg2; and (4) the occurrence of one or more of a date, time, location, person, money, percent, organization named entity. [0028] In addition, the first embodiment may be further characterized in having one or more of the following additional features: the sequence of the first relation in a four relation discourse is approximately 0.250; the first and second spans of text Argl and Arg2 are part of an annotated corpus; the annotated corpus is one of the group consisting of the Penn Discourse Treebank ("PDTB"); Rhetorical Structure Theory corpus; and the Discourse Graph Bank; the annotated corpus is used to train a system to determine classifications; measuring performance relative to the annotated corpus to determine classifier acceptance; applying an accepted classifier to an un-annotated corpus; the first and second spans of text Argl and Arg2 are classified with a rhetorical label stored within the annotated corpus;
surface level features are used to capture pragmatic information encoded in the absent discourse marker; the one or more features comprises a simplified feature set based only on one or both of raw text and semantic dependencies; the rhetorical relation is represented in a hierarchy comprising one or more levels including one or more of class level, type level and subtype level; each level comprises a set of senses; the one or more levels includes a class level comprising the following set of senses: temporal, contingency, comparison and expansion; and the one or more levels includes a type level comprising a set of senses different from the class level set of senses.
[0029] In a second exemplary embodiment, the invention provides a computer-based system for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the system comprising a processor, a memory, a user interface and a display. The system further having: a set of instructions stored in the memory and when executed by the processor adapted to provide a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs; identifying by use of a processor executing a set of code a first factor associated with a first relation and associated with a first span of text Argl and a second factor associated with a second relation and associated with a second span of text Arg2; a rhetorical relation module comprising a set of code when executed by the processor adapted to process one or more of the following features: (1) sequence expressing the first and second relations as a normalized percentage; (2) text unigram, bigram and/or trigrams of Argl and Arg2; (3) unigram, bigram and trigram dependencies of Argl and Arg2; and (4) the occurrence of one or more of a date, time, location, person, money, percent, organization named entity; and an output adapted generate for display a user interface comprising a representation of the rhetorical relation.
[0030] In a third embodiment, the invention provides a computer-implemented method for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the method represented as instructions stored in memory for recall and processing by a processor such that when executed the method provides a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs. In this embodiment the method includes:
generating by use of a processor executing a set of code features relevant for classification including by identifying a first feature associated with a first relation and associated with a first span of text Argl and a second feature associated with a second relation and associated with a second span of text Arg2; testing multiple machine learning algorithms against a corpus of training data; measuring performance of the tested machine learning algorithms; selecting a preferred machine learning algorithm; and applying the selected preferred machine learning algorithm to a proprietary corpus. BRIEF DESCRIPTION OF THE DRAWINGS
[0031] In order to facilitate a full understanding of the present invention, reference is now made to the accompanying drawings, in which like elements are referenced with like numerals. These drawings should not be construed as limiting the present invention, but are intended to be exemplary and for reference.
[0032] Figure 1 is a block diagram illustrating one embodiment of the Rhetorical
Relation Analyzer/Predictor implemented in a document retrieval system architecture according to the present invention.
[0033] Figure 2 is a block diagram further illustrating a system architecture for implementing the embodiment of Figure 1.
[0034] Figure 3 is a graphical representation of actual points plotted in a macro-Fl score vs. training instance count graph in connection with the present invention. DETAILED DESCRIPTION OF THE INVENTION
[0035] The present invention will now be described in more detail with reference to exemplary embodiments as shown in the accompanying drawings. While the present invention is described herein with reference to the exemplary embodiments, it should be understood that the present invention is not limited to such exemplary embodiments. Those possessing ordinary skill in the art and having access to the teachings herein will recognize additional implementations, modifications, and embodiments, as well as other applications for use of the invention, which are fully contemplated herein as within the scope of the present invention as disclosed and claimed herein, and with respect to which the present invention could be of significant utility. [0036] In accordance with the exemplary embodiments described herein, the present invention provides a method and system for simplifying rhetorical relation prediction in a large scale annotated corpus or database. While much is described in the context of PDTB as the exemplary corpus, the invention is not limited to PDTB and may be used with beneficial effect generally with annotated corpora. For example, other annotated corpora include the Rhetorical Structure Theory corpus and the Discourse Graph Bank. These are both academic corpora similar to the PDTB. Ultimately, in keeping with the invention the annotated corpus is used to train a system to figure out good from bad classifications. In addition, one can measure performance relative to the annotated corpus, i.e., how many did the subject classifier get right, how many did it get wrong. Multiple annotated corpora may be used to arrive at the desired features and classifications. Once classifier performance is acceptable relative to the annotated corpus/corpora, the inventive method may be applied to an un- annotated corpus, such as commercial and proprietary corpora, e.g., the Thomson Reuters News Archive. An additional point to make here is that Argl and Arg2 correspond simply to two spans of text. The spans are considered "arguments" and can be sentences or phrases. The PDTB calls them Argl , Arg2, but more generally, for other annotated corpora and un- annotated corpora, the method will identify two spans of text and attempt to classify them with the appropriate rhetorical label.
[0037] More particularly, even if discourse markers are missing, the invention can favorably achieve effective performance for rhetorical relation prediction. In one manner, the rhetorical structure (progression of relations) between Examples (1) and (2) above is arguably similar and open to wider interpretation, but recoverable. Although the invention is described in connection with the PDTB, as it provides a wealth of robustly annotated Wall Street Journal ("WSJ") data and has been the locus of comparative research in this area, the invention is not limited to PDTB. In the PDTB, the ability to predict implicit relations (39% of the annotated relations) has proven to be quite difficult compared to their explicit counterparts. [0038] With reference to Figure 1, the above processes, and as discussed in more detail below, may be carried out in conjunction with the combination of hardware and software and communications networking illustrated in the form of exemplary system 100. In this example, system 100 provides a framework for searching, retrieving, analyzing, and ranking claims and/or documents. System 100 may be used in conjunction with a system offering of a professional services provider, e.g., West Services Inc., a part of Thomson Reuters Corporation, and in this example includes a Central Network Server/Database Facility 101 comprising a Network Server 102, a Proprietary Corpora Database, e.g., Thomson Reuters News Archive, 103, a Document Retrieval System 104 having as components a Rhetorical Relations Analyzer (RRA) 105, a Feature Extraction module 106, a Machine Learning Module (e.g., SVM), 107 and a Machine Learning Algorithm
Testing/Training Data Module 108.
[0039] Feature Extraction Module 106 creates features relevant for classification.
Machine Learning Module 107 includes algorithms and processes for performing any of one or more machine learning approaches/techniques. Although the exemplary embodiments described herein often refer to support vector machine "SVM" the invention is not limited to this approach. For example, and not by way of limitation, in addition to SVM the Machine Learning Module 107 may use or include Naive Bayes and Decision Tree classification algorithms as are well known in the art. Machine Learning Testing/Training Data Module 108 allows the user to test the performance of multiple machine learning
algorithms/techniques against one or more corpora or training date. The invention creates features that could, in theory, be used with any machine learning algorithm. In one manner, the invention may be used as follows: (1) create features relevant for classification; (2) test multiple machine learning algorithms against training data, e.g., against known annotated corpus such as PDTB; (3) measure and record performance of the tested machine learning algorithms; (4) select the preferred machine learning algorithm; and (5) apply the selected preferred machine learning algorithm to a proprietary corpus, e.g., Thomson Reuters News Archive. [0040] The Central Facility 101 may be accessed by remote users 109, such as via a network 126, e.g., Internet. Aspects of the system 100 may be enabled using any
combination of Internet or (World Wide) WEB-based, desktop-based, or application WEB- enabled components. The remote user system 109 in this example includes a GUI interface operated via a computer 110, such as a PC computer or the like, that may comprise a typical combination of hardware and software including, as shown in respect to computer 110, system memory 112, operating system 114, application programs 116, graphical user interface (GUI) 118, processor 120, and storage 122 which may contain electronic information 124 such as electronic documents. The methods and systems of the present invention, described in detail hereafter, may be employed in providing remote users access to a searchable database.
[0041] In particular, remote users may search a patent document database using search queries based on patent claims to retrieve and view patent documents of interest. Because the volume of patent documents is quite high, the invention provides scoring and ranking processes that facilitate an efficient and highly effective, and much improved, searching and retrieving operation. Client side application software may be stored on machine-readable medium and comprising instructions executed, for example, by the processor 120 of computer 1 10, and presentation of web-based interface screens facilitate the interaction between user system 109 and central system 101. The operating system 114 should be suitable for use with the system 101 and browser functionality described herein, for example, Microsoft Windows Vista (business, enterprise and ultimate editions), Windows 7, or Windows XP Professional with appropriate service packs. The system may require the remote user or client machines to be compatible with minimum threshold levels of processing capabilities, e.g., Intel Pentium III, speed, e.g., 500 MHz, minimal memory levels and other parameters.
[0042] The configuration thus described in this example is one of many and is not limiting as to the invention. Central system 101 may include a network of servers, computers and databases, such as over a LAN, WLAN, Ethernet, token ring, FDDI ring or other communications network infrastructure. Any of several suitable communication links are available, such as one or a combination of wireless, LAN, WLAN, ISDN, X.25, DSL, and ATM type networks, for example. Software to perform functions associated with system 101 may include self-contained applications within a desktop or server or network environment and may utilize local databases, such as SQL 2005 or above or SQL Express, IBM DB2 or other suitable database, to store documents, collections, and data associated with processing such information. In the exemplary embodiments the various databases may be a relational database. In the case of relational databases, various tables of data are created and data is inserted into, and/or selected from, these tables using SQL, or some other database-query language known in the art. In the case of a database using tables and SQL, a database application such as, for example, MySQL™, SQLServer™, Oracle 81™, 10G™, or some other suitable database application may be used to manage the data. These tables may be organized into an RDS or Object Relational Data Schema (ORDS), as is known in the art.
[0043] Now with reference to Figure 2, an exemplary representation of a machine in the example form of a computer system 200 within which a set of instructions may be executed to cause the machine to perform any one or more of the methodologies discussed herein. In particular, the system 200, and variations of this, may be used to implement the Document Retrieval System 104 of Figure 1 and/or components of that system, e.g.,
Rhetorical Relations Analyzer 105, Feature Extraction Module 106, Machine Learning Algorithm Module 107, and Machine Learning Testing/Training Data Module 108. In alternative embodiments, the machine operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may comprise a server computer, a client computer, a personal computer (PC), a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
[0044] The example computer system 200 includes a processor 202 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both), a main memory 204 and a static memory 506, which communicate with each other via a bus 508. The computer system 200 may further include a video display unit 210, a keyboard or other input device 212, a cursor control device 214 (e.g., a mouse), a storage unit 216 (e.g., hard-disk drive), a signal generation device 218, and a network interface device 220.
[0045] The storage unit 216 includes a machine-readable medium 222 on which is stored one or more sets of instructions (e.g., software 224) embodying any one or more of the methodologies or functions illustrated herein. The software 224 may also reside, completely or at least partially, within the main memory 204 and/or within the processor 202 during execution thereof by the computer system 200, the main memory 204 and the processor 202 also constituting machine -readable media. The software 224 may further be transmitted or received over a network 226 via the network interface device 220.
[0046] While the machine-readable medium 222 is shown in an example embodiment to be a single medium, the term "machine-readable medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "machine- readable medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention. The term "machine -readable medium" shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals. [0047] In accordance with the present invention, 31 ,748 total relations were extracted from the PDTB. Of the total relations extracted, 16831 (53%) were explicit relations, or "explicits," and 14917 (47%) were implicit relations, or "implicits." The distribution of the implicit data is given in Table 1. The data is predominantly "News" text (12368 - 83%), but other genres are represented as well: "Essays" - 1963(13%); "Highlights" - 317(2%);
"Letters" -259 and (2% ); "Errata" - 10(.06%) (Bonnie Webber. 2009. Genre Distinctions for Discourse in the Penn Tree bank. In Proceedings for the Conference of the Association of Computational Linguistics - Webber (2009)). In this process, only singly annotated relations were extracted although multiply annotated relations were also present. For example, the minimum number of relations expected for a given discourse of length k is equal to k-l . This is flat, backward looking hierarchy (if forward looking, the total number of relations would be k(k-l)). If non-adjacent clauses are considered, then the maximum number of relations does not exceed the Triangle number T(n); where n=k-\. If a hierarchical structure is considered, the maximum number of relations does not exceed the Catalan number (Schilder, 2002).
Class Distribution
EXPANSION 8034 (53%)
CONTINGENCY 3936 (27%)
COMPARISON 2265 (15%)
TEMPORAL 682 (5%)
Total 14917 (100%)
Type Distribution
CONTINGENCY. CAUSE 3935 (26%)
EXPANSION.CONJUNCTION 3123 (21%)
EXPANSION. RESTATEMENT 2995 (20%)
COMPARISON.CONTRAST 1912 (13%)
EXPANSION. INSTANTIATION 1373 (9%)
TEMPORAL.ASYNCHRONOUS 592 (4%)
EXPANSION.LIST 350 (2%)
COMPARISON.CONCESSION 204 (1%)
EXPANSION.ALTERNATIVE 176 (1%)
TEMPORAL. SYNCHRONY 90 (.01%)
CONTINGENCY.PRAGCAUSE 61 (.01%)
Total 14811 (100%)
Table 1 - Implicit Relation Distribution
[0048] For each relation and associated span of text Argl and Arg2 developed the following features: (1) sequence - where in the document the relation occurred expressed as a normalized percentage (i.e., the sequence of the first relation in a four relation discourse would be 0.250); (2) text unigram, bigram and trigrams of Argl and Arg2; (3) unigram, bigram and trigram dependencies of Argl and Arg2 individually and combined using the Stanford Dependency Parser (see de Marneffe et al. (2006) for a full explanation of dependency node types); and (4) the occurrence of a date, time, location, person, money, percent, organization named entity (using the Stanford Named Entity Recognizer ("NER") (Jenny Rose Finkel and Trond Grenager and Christopher Manning. 2005. Incorporating Nonlocal Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 363- 370 - Finkel et al. (2005)).
[0049] For purposes of describing the invention, we examine the following two example feature vectors in the context of exemplary documents "Document ID: wsj 0692," having Relation: COMPARISON.CONTRAST, and "Document ID: wsj_1824" having Relation: TEMPORALASYNCHRONOUS.SUCCESSION. For "Document ID: wsj_0692," the following is known: Argl Text: "Anyway ZBB"s procedures were so cumbersome that everyone involved was crushed under a burden of marginalia;" Argl NER: NULL; Argl Dependency: ADVMOD POSS NSUBJ COP ADVMOD ROOT COMPLM NSUBJPASS PARTMOD AUXPASS CCOMP DET PREP UNDER PREP OF; Arg2 Text: A strategic review is fundamentally different; Arg2 NER: NULL; Arg2 Dependency: DET AMOD NSUBJ COP ADVMOD ROOT. In this exemplary document we further use the combined dependency from the two spans of text Argl and Arg2: ADVMOD POSS NSUBJ COP ADVMOD ROOT COMPLM NSUBJPASS PARTMOD AUXPASS CCOMP DET PREP .. UNDER PREP .. OF DET AMOD NSUBJ COP ADVMOD RCMOD; and Sequence: 0.8 [0050] For "Document ID: wsj_1824" having Relation:
TEMPORALASYNCHRONOUS.SUCCESSION, we know the following: Argl Text: But the pool of federal emergency-relief funds already is running low because of the heavy costs of cleaning up Hurricane Hugo and Congress will be under pressure to allocate more money quickly; Argl NER: ORGANIZATON; Arg 1 Dependency: DET NSUBJ AMOD NN
PREP OF ADVMOD AUX ROOT ADVMOD DET AMOD PREP BECAUSE OF PREPC
. .OF PRT NN DOBJ NSUB.i AUX CONLAND PREP .. UNDER AUX XCOMP AMOD DOBJ ADVMODF. Arg2 Text: In Hugo's wake Congress allocated $1.1 billion in relief funds; Arg2 NER: ORGANIZATION, MONEY; Arg2 Dependency: POSS PREP.JN NSUBJ ROOT DOBJ NUMBER NUMBER NN PREP IN. In this exemplary document we further use the combined dependency from the two spans of text Argl and Arg2: DET NSUBJ AMOD NN PREP . .OF ADVMOD AUX ROOT ADVMOD DET AMOD
PREP BECAUSE OF PREPC .. OF PRT NN DOBJ NSUBJ AUX CONL.AND PREP .. UNDER AUX XCOMP AMOD DOBJ ADVMOD POSS PREP IN NSUBJ RCMOD DOBJ NUMBER NUMBER NN PREP IN; Sequence: 0.16 [0051] The Argl and Arg2 texts, dependencies and combined dependencies are converted to unigram, bigram and trigram lists (some linearity information (i.e., syntactic) is preserved in the bigram and trigram versions) and are treated as "bags of words." There is not a major difference between Argl and Arg2 dependencies and combined dependencies. The most common change is that the ROOT dependency of Arg2 is reassigned as a relative clause modifier (RC MOD) which provides ever so slightly more information than the individual argument dependencies alone.
[0052] With respect to testing, results and comparisons, using two experiments - predicting Class and Type level relations. We report the results using Scikit-Learn 's (Pedregosa et al, 2011) LinearSVC (which uses the LIBLINEAR library (Fan et al, 2008)) using tf-idf normalization for each feature set. Similar to Pitler et al. (2009), the system was trained on folders 2-20 and presented test results are based on the hold out test set (21 and 22). Specifically, hyper parameters were found with 10-fold cross-validation. This was done for each combination of features. The hyper-parameters that yielded the lowest cross- validation error in terms of Fl were used to make a model trained on the entire training set to predict the test error via the holdout set.
[0053] The results reported in Tables 2 and 3 below for the invention are based on the best combination of features "System Feature Combination," best individual relations "System Feature Subset," and, in the interest of finding the most economical approach, we took the lowest number of features within .01 of the top performing system "System Feature Economic" (penalizing more features see e.g. Akaike (1974)). If we take as a goal that the ability to recover discourse structure via rhetorical relations, the focus on the most parsimonious single system output is more appropriate. From an implementation standpoint, running multiple different classifiers to take the best results for any given individual point of classification potentially increases system complexity by a significant margin. However, these results are reported in Tables 2 and 3 for sake of completeness. We compare against Pitler et al. (2009) and Zhou et al. (2010) at the Class level (comparing Fls) and Lin et al. (2009) for the Type level
Figure imgf000022_0001
Table 2 - Class Level Fl Results Comparison.
Figure imgf000022_0002
Single Best
Temporal .Asynchronous 13 22 28
Temporal . Synchrony 0 0 0
Contingency . Cause 51 40 45
Contingency . PragmaticC ause 0 4 12
Comparison.Contrast 15 30 31
Comparison.Concession 0 2 6
Expansion.Conjunction 38 30 34
Expansion.Instantiation 49 42 47
Expansion.Restatement 35 28 29
Expansion.Alternative 0 16 23
Expansion.List 23 18 23
20.36 21.49 25.27
Table 3 - Type Level Results Comparison (Macro-
[0054] At the Class level, the system of the invention outperforms Pitler et al. (2009) and Zhou et al. (2010) on COMPARISON (+0.27% to +10.34%) and TEMPORAL (+9.77% to +10.83%) relations, but not EXPANSION (-8.31% to -5.33</*o) and CONTINGENCY (- 2.86 to +7.53%). Nonetheless, the gains on COMPARISON and TEMPORAL more than make up the difference to achieve top performance on the macro-Fl. However, in terms of statistical significance (single-tailed z-test), while our system is significantly better than Pitler et al. (2009) (p=.0003), we have not been able to demonstrate statistical significance over Zhou et al. (2010) (p=.3810). The same result trend holds for picking and choosing the best overall single relation performance from all of the possible classifiers.
[0055] The best single feature combination by our system (System Feature
Combination) was based on (1) unigram and bigram combined dependencies; (2) bigram dependencies; (3) NER; and (4) unigram and bigram texts. Ultimately, this is a very simple set of features - basically different combinations of text and dependencies. If NER is not included, macro-Fl is 41.08 which still outperforms Pitler et al. (2009) and Zhou et al. (2010) and would represent a favorable drop in feature processing complexity. System Feature Economic meets Zhou et al. (2009) using only (1) combined dependency bigrams; (2) individual dependency unigrams; and (3) text unigrams.
[0056] At the Type level, our system outperforms that of Lin et al. (2009) by L 13%> for macro-Fl. Lin et al. (2009) outperforms our system for EXPANSION .CONJUNCTION, EXPANSION.INSTANTIATION, EXPANSION .RESTATEMENT, EXPANSION.UST CONTINGENCY. CAUSE where we form on TEMPORAL.ASYNCHRONOUS, and outperCONTINGENCY. PRAGMATIC CAUSE, COMPARISON .CONTRAST,
COMPARISON .CONCESSION, and EXPANSJON.ALTERNATIVE. This makes sense considering that COMPARISON and TEMPORAL relations performed comparatively better in our system at the Class level. However, we are grabbing 10 of the 11 Type relations compared to their 7 with fewer and simpler features: dependency unigrams, combined dependency bigrams and text unigrams and bigrams.
[0057] Now we focus the discussion on the results of the system's economic model, the dimensionality of the features used and the learning rate of predicting Class level senses in the PDTB with combined dependency bigrams, argument dependency unigrams and text unigrams.
[0058] With respect to features, we used 10-fold cross-validation (iterating over different combinations of the intercept (I) and regularization (c) hyper parameters) and GridSearchCV in Scikit-Learn to determine optimal features for the SVM. Four values for each hyper-parameter (.001, .01, .1, 1) were determined from L2 regularization (post L2 normalization). While we report fewer and simpler features overall as compared to previous research, these features do have a high individual dimensionality: combined dependency bigrams = 6390; argument dependency unigrams = 490 (Argl=287, Arg2=203); and text unigrams 22191 (Argl=10658, Arg2=l 1533). However, it is not the case that all dimensions in the best performing features are contributing equally. Table 4 indicates the distribution of positively (+) and negatively (-) contributing and non-contributing (0) features.
Figure imgf000024_0001
Table 4 - Class Level Contributing Feature Distribution.
[0059] EXPANSION had the most positively contributing features at 48%; followed by CONTINGENCY (40%), COMPARISON (33%) and TEMPORAL (21 %). Conversely, TEMPORAL had the highest proportion of features that negatively contributed at 73%; followed COMPARISON (63%), CONTINGENCY (57%) and EXPANSION (49%). For all Class level relations, 2-5%> of features did not contribute. [0060] Table 5 indicates that while different Class level features rely on a range of positively contributing features (21-48% of all dimensions), each Class relies on a very similar distribution of those dimensions with, for an individual Class classification, roughly 20cfo relying on combined dependencies (Comb. Dep.). up to 1% for Argl and Arg2 dependencies (Dep.), and about 40% on Argl and Arg2 Texts- with CONTINGENCY and EXPANSION requiring slightly more from the Arg2 Text rather than Argl Text. However, this observed distribution could also be because there are so many more unigrams than dependencies; overall, about half of the dependencies, but less than half of the text unigrams are contributing.
Figure imgf000025_0001
Table 5 - Class Level Contributing Feature Type Distribution.
[0061] In Table 6, which focuses on the top 1 0 features contributing to each Class level relation, we see that the TEMPORAL and CONTINGENCY relations involve more textual features and only a couple of combined dependencies whereas EXPANSION is a more homogenous mix, but COMPARISON exclusively combined dependencies - in particular, bigrams either starting with an abbreviation modifier (abbrev) or an adjectival complement (acomp). For TEMPORAL, the text unigrams are a combination of stop words (he, was,, had, been, in) and temporal adverbs (really, markers) such as when and later. Stop words appear to play an important role in the other relations as well: EXPANSION- from, has, DET (determiner); CONTINGENCY -you. is. these, that, can for, and COMPARISON - AUX, DET, CONJUNCTIVE OR. The role of stop words and the contribution in implicit relation prediction has been observed in Marcu and Echihabi (2002) and Blair-Goldensohn et al. (2007) - in particular, that removing them from the corpus hurts performance. Some text features reveal facts about the corpus, but will have weak generalizeability. For example, market, investors in CONTINGENCY, mr. in TEMPORAL and rose in EXPANSION. TEMPORAL EXPANSION CONTINGENCY COMPARISON
TEXT ARG2 CDEP abbrev
TEXT ARG1 he DEP ARG2 appos market aux
CDEP prep in
num TEXT ARG2 even TEXT ARG2 you CDEP abbrev det
TEXT ARG1 DEP ARG2 prep CDEP abbrev was from TEXT ARG2 is dobj
TEXT ARG2 CDEP abbrev when DEP ARG2 num CDEP advcl nn root
CDEP num prep TEXT ARG1 CDEP acomp
TEXT ARG2 had from these conj or
TEXT ARG2 DEP ARG2 TEXT ARG2
later number investors CDEP acomp dep
TEXT ARG1 CDEP ccomp
named number TEXT ARG1 that CDEP acomp det
TEXT ARG2 CDEP acomp been TEXT ARG2 rose TEXT ARG2 can dobj
CDEP acomp
TEXT ARG2 mr TEXT ARG1 has TEXT ARG2 sell mark
CDEP prt det CDEP det poss CDEP nn prep for CDEP acomp nn
Table 6 - Class Level Top 10 Contributing Features. [0062] So, it appears that, consistent with prior research, that there are indeed textual features that systematically co-occur with different Class relations and, for all intense and purposes, "approximates" what a discourse marker would do, especially with pairing up associated coarse-grained semantic information. However, with only 40% or so performance, this approximation is comparatively rather weak. Further, while prior state of the art systems rightfully explore ways to increase the approximation by relying on a multitude of complex features designed to boost the effects of the textual features, we argue in the next section that relying on text level features and logical extensions thereof may continue to yield mediocre results because of what can realistically be learned.
[0063] With respect to learning rates, to improve performance, more data could be added to see if prediction accuracy increases; however, prediction accuracy could also suffer. For example, if the explicit data is added to the training set, performance degrades slightly by 1-2 percentage points (observed by Zhou et al. (2010)). While this lower performance could be because the distribution of Class relations is different compared to the implicit data (Expansion - 5722 (34%); Temporal - 2850 (17%); Comparison- 5240 (31%); Contingency- 3018 (18%))), based on a comparison of word and dependency distributions between the implicits and explicits, there is little difference in the nature of the underlying data. This suggests on some level that even if more representative implicit data could be found and annotated similar to the PDTB, performance of class level implicit rhetorical relation prediction based on surface features and relevant extensions is simply limited by the theoretical nature of the endeavor.
[0064] With reference to Figure 3, to illustrate consider the graphical representation of Macro Fl Score vs. Training Instance Count of graph 300. To decompose classifier error in an effort to determine if more data would potentially increase performance for the proposed features, classifier and data set, we follow Vapnik (Vladimir Vapnik 1995. The Nature of Statistical Learning Theory. Springer- Verlag New York, Inc., New York, NY - Vapnik (1995)):
, VC
E = C + a * V(—)
N
where a is the learning rate. VC is the Vapnik-Chervonenkis dimension of the classifier (Vladimir Vapnik and Alexy Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), 264-280 - Vapnik and Chervonenkis (1971)). N is the number of training examples. C is the in-sample error. As N approaches infinity, only C contributes to the error. The is because with an infinite amount of data, everything is in-sample. It also makes sense because if you take the limit as N approaches infinity you're only left with C. Note also that the limit of E as N approaches infinity is C. So if we can calculate C, we know the theoretical error if we had an infinite amount of data (Note that if we get VC wrong, a different a will be learnt, but C will remain the same).
[0065] To calculate C, we trained on the holdout set and recorded the error. This gave us a set of (E, N) pairs. We let VC = 1 and k = (~)- For each (E, N) pair, we can get a (E, k) pair, of which E is a linear combination. This allowed us to use ordinary least-squares regression on the set of points (E, k) to find C and a (assuming a normal distribution).
[0066] As indicated in graph 300 of Figure 3, the theoretical limit is shown by bar
302 in this instance as given by (4) is a Macro-Fl of 41.30, indicating the invention is essentially at maximum performance and having more data would not be beneficial using the same or similar features (text unigrams, dependency unigrams and combined dependency bigrams) for the economic system classifier on the PDTB. However, given the nature of the features, it is possible to extrapolate that similar limits to performance will be found for this task on the PDTB.
[0067] In sum, the explicit marker is influential in cuing discourse structure in
English - it is the best possible information. When it is absent, we may retrieve it from text and by using associated semantics. However, as this research indicates along with an evaluation of in-sample error decomposition, the ability to do this is limited. This reality is in step with underlying theories of pragmatics and discourse structure. For example, there is something odd about having a discourse marker at the beginning of every clause (e.g., potentially violates Grice's manner maxim), so we expect natural language discourses in English to have a fair share of implicit markers, but it's not the case that the understandability of the discourse structure hopelessly breaks down in the absence of a marker. Human inter- annotator agreement "ceiling" for the PDTB for explicit and implicit relations combined is 94% for Class, 84% for Type and 80% for Subtype (Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki. Livio Robaldo. Aravind Joshi and Bonnie Webber. 2008. The Penn
Discourse TreeBank 2.0. In Proceedings of the International Cor-ference on Language
Resources and Evaltwtion (LREC-08) - Prasad et al. (2008)). Therefore, pushing this research forward will require the annotation and surface level association with some type of interpretive assumptions at document level.
[0068] The invention improves performance on a simple and easily implementable feature set for implicit rhetorical relation prediction in the PDTB. The feature engineering in accord with the invention was drastically reduced compared to prior systems and did not require any special processing on the corpus other than running of the dependency parser. Computationally, the system of the invention is very efficient in this respect.
[0069] In implementation, the inventive concepts may be automatically or semi- automatically, i.e., with some degree of human intervention, performed. Also, the present invention is not to be limited in scope by the specific embodiments described herein. It is fully contemplated that other various embodiments of and modifications to the present invention, in addition to those described herein, will become apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such other embodiments and modifications are intended to fall within the scope of the following appended claims. Further, although the present invention has been described herein in the context of particular embodiments and implementations and applications and in particular environments, those of ordinary skill in the art will appreciate that its usefulness is not limited thereto and that the present invention can be beneficially applied in any number of ways and environments for any number of purposes. Accordingly, the claims set forth below should be construed in view of the full breadth and spirit of the present invention as disclosed herein.

Claims

WE CLAIM
1. A computer-implemented method for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the method represented as instructions stored in memory for recall and processing by a processor such that when executed the method provides a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs, the method comprising: a. identifying by use of a processor executing a set of code a first factor associated with a first relation and associated with a first span of text Argl and a second factor associated with a second relation and associated with a second span of text Arg2; and b. processing one or more of the following features: (1) sequence expressing the first and second relations as a normalized percentage; (2) text unigram, bigram and/or trigrams of Argl and Arg2; (3) unigram, bigram and trigram dependencies of Argl and Arg2; and (4) the occurrence of one or more of a date, time, location, person, money, percent, organization named entity.
2. The computer-implemented method of claim 1, wherein the sequence of the first relation in a four relation discourse is approximately 0.250.
3. The computer-implemented method of claim 1, wherein the first and second spans of text Argl and Arg2 are part of an annotated corpus.
4. The computer-implemented method of claim 3, wherein the annotated corpus is one of the group consisting of the Penn Discourse Treebank ("PDTB"); Rhetorical Structure Theory corpus; and the Discourse Graph Bank.
5. The computer-implemented method of claim 3, wherein the annotated corpus is used to train a system to determine classifications.
6. The computer-implemented method of claim 3, further comprising measuring performance relative to the annotated corpus to determine classifier acceptance.
7. The computer-implemented method of claim 6, further comprising applying an accepted classifier to an un-annotated corpus.
8. The computer-implemented method of claim 2, wherein the first and second spans of text Argl and Arg2 are classified with a rhetorical label stored within the annotated corpus.
9. The computer-implemented method of claim 1, wherein surface level features are used to capture pragmatic information encoded in the absent discourse marker.
10. The computer-implemented method of claim 1, wherein the one or more features comprises a simplified feature set based only on one or both of raw text and semantic dependencies.
11. The computer-implemented method of claim 1 , wherein the rhetorical relation is represented in a hierarchy comprising one or more levels including one or more of class level, type level and subtype level.
12. The computer-implemented method of claim 11, wherein each level comprises a set of senses.
13. The computer-implemented method of claim 11, wherein the one or more levels includes a class level comprising the following set of senses: temporal, contingency, comparison and expansion, and includes a type level comprising a set of senses different from the class level set of senses
14. The computer-implemented method of claim 1, wherein the machine learning process includes one or more of: support vector machine (SVM), Na'ive Bayes, and Decision Tree classification algorithms.
15. A computer-based system for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the system comprising a processor, a memory, a user interface and a display, the system further comprising: a. a set of instructions stored in the memory and when executed by the processor adapted to provide a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs; b. identifying by use of a processor executing a set of code a first factor associated with a first relation and associated with a first span of text Argl and a second factor associated with a second relation and associated with a second span of text Arg2; c. a rhetorical relation module comprising a set of code when executed by the processor adapted to process one or more of the following features: (1) sequence expressing the first and second relations as a normalized percentage; (2) text unigram, bigram and/or trigrams of Argl and Arg2; (3) unigram, bigram and trigram dependencies of Argl and Arg2; and (4) the occurrence of one or more of a date, time, location, person, money, percent, organization named entity; and d. an output adapted generate for display a user interface comprising a representation of the rhetorical relation.
16. The computer-based system of claim 15, wherein the first and second spans of text Argl and Arg2 are obtained from a corpus of documents.
17. The computer-based system of claim 16, wherein the corpus of documents is annotated.
18. The computer-based system of claim 17, wherein the corpus of documents is one of the group consisting of the Penn Discourse Treebank ("PDTB"); Rhetorical Structure Theory corpus; and the Discourse Graph Bank.
19. The computer-based system of claim 15, wherein surface level features are used to capture pragmatic information encoded in the absent discourse marker.
20. The computer-based system of claim 15, wherein the one or more features comprises a simplified feature set based only on one or both of raw text and semantic dependencies.
21. The computer-based system of claim 15, wherein the rhetorical relation is represented in a hierarchy comprising one or more levels including one or more of class level, type level and subtype level.
22. The computer-based system of claim 21, wherein each level comprises a set of senses.
23. The computer-based system of claim 22, wherein the one or more levels includes a class level comprising the following set of senses: temporal, contingency, comparison and expansion, and a type level comprising a set of senses different from the class level set of senses.
24. The computer-based system of claim 15, wherein the machine learning process includes one or more of support vector machine (SVM), Na'ive Bayes, and Decision Tree classification algorithms.
25. A computer-implemented method for predicting implicit rhetorical relation between spans of text in the absence of an explicit discourse marker, the method represented as instructions stored in memory for recall and processing by a processor such that when executed the method provides a feature vector model comprising a representation of simplified feature set based on raw text and semantic dependencies implemented with a machine learning process, wherein the model comprises one or more inputs and one or more outputs, the method comprising: a. generating by use of a processor executing a set of code features relevant for
classification including by identifying a first feature associated with a first relation and associated with a first span of text Argl and a second feature associated with a second relation and associated with a second span of text Arg2; b. testing multiple machine learning algorithms against a corpus of training data; c. measuring performance of the tested machine learning algorithms; d. selecting a preferred machine learning algorithm; and e. applying the selected preferred machine learning algorithm to a proprietary corpus.
26. The computer-implemented method of claim 25, wherein the corpus of training data is an annotated corpus.
27. The computer-implemented method of claim 26, wherein the corpus of documents is one of the group consisting of the Penn Discourse Treebank ("PDTB"); Rhetorical Structure Theory corpus; and the Discourse Graph Bank.
28. The computer-implemented method of claim 25, wherein the machine learning process includes one or more of support vector machine (SVM), Na'ive Bayes, and Decision Tree classification algorithms.
PCT/US2014/045432 2013-07-03 2014-07-03 Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus WO2015003143A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU2014285073A AU2014285073B9 (en) 2013-07-03 2014-07-03 Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
CA2917153A CA2917153C (en) 2013-07-03 2014-07-03 Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361842635P 2013-07-03 2013-07-03
US61/842,635 2013-07-03

Publications (2)

Publication Number Publication Date
WO2015003143A2 true WO2015003143A2 (en) 2015-01-08
WO2015003143A3 WO2015003143A3 (en) 2015-05-14

Family

ID=52144292

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/045432 WO2015003143A2 (en) 2013-07-03 2014-07-03 Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus

Country Status (3)

Country Link
AU (1) AU2014285073B9 (en)
CA (1) CA2917153C (en)
WO (1) WO2015003143A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111149100A (en) * 2017-09-28 2020-05-12 甲骨文国际公司 Determining thesaurus interrelationships across documents based on named entity parsing and recognition
CN113377915A (en) * 2021-06-22 2021-09-10 厦门大学 Dialogue chapter analysis method
CN113407713A (en) * 2020-10-22 2021-09-17 腾讯科技(深圳)有限公司 Corpus mining method and apparatus based on active learning and electronic device
CN113535973A (en) * 2021-06-07 2021-10-22 中国科学院软件研究所 Event relation extraction and language sentence relation analysis method and device based on knowledge mapping
CN113553830A (en) * 2021-08-11 2021-10-26 桂林电子科技大学 Graph-based English text sentence language piece coherent analysis method
US11782985B2 (en) 2018-05-09 2023-10-10 Oracle International Corporation Constructing imaginary discourse trees to improve answering convergent questions
US11809825B2 (en) 2017-09-28 2023-11-07 Oracle International Corporation Management of a focused information sharing dialogue based on discourse trees
US11880652B2 (en) 2019-11-14 2024-01-23 Oracle International Corporation Detecting hypocrisy in text

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209366B (en) * 2019-10-10 2023-04-21 天津大学 Implicit chapter relation recognition method of mutual excitation neural network based on TransS driving
CN112257460B (en) * 2020-09-25 2022-06-21 昆明理工大学 Pivot-based Hanyue combined training neural machine translation method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5659766A (en) * 1994-09-16 1997-08-19 Xerox Corporation Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision
AU2001261506A1 (en) * 2000-05-11 2001-11-20 University Of Southern California Discourse parsing and summarization
US7062561B1 (en) * 2000-05-23 2006-06-13 Richard Reisman Method and apparatus for utilizing the social usage learned from multi-user feedback to improve resource identity signifier mapping
US7127208B2 (en) * 2002-01-23 2006-10-24 Educational Testing Service Automated annotation
US7305336B2 (en) * 2002-08-30 2007-12-04 Fuji Xerox Co., Ltd. System and method for summarization combining natural language generation with structural analysis

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111149100B (en) * 2017-09-28 2023-08-29 甲骨文国际公司 Determining a paraphrasing interrelationship across documents based on resolution and identification of named entities
CN111149100A (en) * 2017-09-28 2020-05-12 甲骨文国际公司 Determining thesaurus interrelationships across documents based on named entity parsing and recognition
US11809825B2 (en) 2017-09-28 2023-11-07 Oracle International Corporation Management of a focused information sharing dialogue based on discourse trees
US11797773B2 (en) 2017-09-28 2023-10-24 Oracle International Corporation Navigating electronic documents using domain discourse trees
US11782985B2 (en) 2018-05-09 2023-10-10 Oracle International Corporation Constructing imaginary discourse trees to improve answering convergent questions
US11880652B2 (en) 2019-11-14 2024-01-23 Oracle International Corporation Detecting hypocrisy in text
CN113407713B (en) * 2020-10-22 2024-04-05 腾讯科技(深圳)有限公司 Corpus mining method and device based on active learning and electronic equipment
CN113407713A (en) * 2020-10-22 2021-09-17 腾讯科技(深圳)有限公司 Corpus mining method and apparatus based on active learning and electronic device
CN113535973A (en) * 2021-06-07 2021-10-22 中国科学院软件研究所 Event relation extraction and language sentence relation analysis method and device based on knowledge mapping
CN113535973B (en) * 2021-06-07 2023-06-23 中国科学院软件研究所 Event relation extraction and language-to-language relation analysis method and device based on knowledge mapping
CN113377915B (en) * 2021-06-22 2022-07-19 厦门大学 Dialogue chapter analysis method
CN113377915A (en) * 2021-06-22 2021-09-10 厦门大学 Dialogue chapter analysis method
CN113553830B (en) * 2021-08-11 2023-01-03 桂林电子科技大学 Graph-based English text sentence language piece coherent analysis method
CN113553830A (en) * 2021-08-11 2021-10-26 桂林电子科技大学 Graph-based English text sentence language piece coherent analysis method

Also Published As

Publication number Publication date
AU2014285073B9 (en) 2017-04-06
WO2015003143A3 (en) 2015-05-14
CA2917153A1 (en) 2015-01-08
AU2014285073B2 (en) 2016-11-03
AU2014285073A1 (en) 2016-02-04
CA2917153C (en) 2022-05-17

Similar Documents

Publication Publication Date Title
US9355372B2 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
AU2014285073B9 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
US9317498B2 (en) Systems and methods for generating summaries of documents
Yi et al. Sentiment mining in WebFountain
Yeh et al. Text summarization using a trainable summarizer and latent semantic analysis
Chali et al. Query-focused multi-document summarization: Automatic data annotations and supervised learning approaches
Chen et al. Towards robust unsupervised personal name disambiguation
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
Yalcin et al. An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding
Zhang et al. Enhancing keyphrase extraction from academic articles with their reference information
Devi et al. A hybrid document features extraction with clustering based classification framework on large document sets
Zhang et al. The use of dependency relation graph to enhance the term weighting in question retrieval
Jaradat et al. Hybrid-based Arabic single-document text summarization approach using genatic algorithm
Laddha et al. Aspect opinion expression and rating prediction via LDA–CRF hybrid
Fagan et al. An introduction to textual econometrics
Sun et al. Identifying, indexing, and ranking chemical formulae and chemical names in digital documents
You et al. Joint learning-based heterogeneous graph attention network for timeline summarization
Sharma et al. Diverse feature set based Keyphrase extraction and indexing techniques
Zhou et al. Semantic Smoothing of Document Models for Agglomerative Clustering.
Mason An n-gram based approach to the automatic classification of web pages by genre
Tahmasebi Models and algorithms for automatic detection of language evolution: towards finding and interpreting of content in long-term archives
Brand et al. N-gram representations for comment filtering
Xu et al. Exploiting paper contents and citation links to identify and characterise specialisations
Sizov Extraction-based automatic summarization: Theoretical and empirical investigation of summarization techniques
Ceylan Investigating the extractive summarization of literary novels

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2917153

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2014285073

Country of ref document: AU

Date of ref document: 20140703

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14820158

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 14820158

Country of ref document: EP

Kind code of ref document: A2