CA2363017A1 - Multi-document summarization system and method - Google Patents

Multi-document summarization system and method Download PDF

Info

Publication number
CA2363017A1
CA2363017A1 CA002363017A CA2363017A CA2363017A1 CA 2363017 A1 CA2363017 A1 CA 2363017A1 CA 002363017 A CA002363017 A CA 002363017A CA 2363017 A CA2363017 A CA 2363017A CA 2363017 A1 CA2363017 A1 CA 2363017A1
Authority
CA
Canada
Prior art keywords
phrases
nodes
phrase
temporal
root nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CA002363017A
Other languages
French (fr)
Other versions
CA2363017C (en
Inventor
Kathleen R. Mckeown
Regina Barzilay
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Columbia University in the City of New York
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2363017A1 publication Critical patent/CA2363017A1/en
Application granted granted Critical
Publication of CA2363017C publication Critical patent/CA2363017C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A summary for a collection of related documents can be generated by extracting phrases (100) from the documents which include common focus elements. Phrase intersection analysis (120) is then performed on the extracted phrases (100) to generate a phrase intersection table, where identical or equivalent phrases are identified. Temporal processing (140) on the phrases in the phrase intersection table is performed to remove ambiguous time references and to sort the phrases in a temporal sequence. Sentence generation is then used to combine the phrases in the phrase intersection table into a coherent summary.

Claims (22)

1. A method for generating a summary of a plurality of related documents in a collection comprising:
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
2. The method of generating a summary as defined by claim 1, wherein the phrase intersection analysis comprises:
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
3. The method of claim 2, wherein the tree structure is a DSYNT tree structure.
4. The method of claim 2, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.
5. The method of claim 1, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
6. The method of claim 1, further comprising a phrase divergence processing operation.
7. The method of claim 1, wherein the sentence generation includes mapping phrases to an input format of a language generation engine and operating the language generation engine.
8. A system for generating a summary of a plurality of related documents in a collection comprising:
a storage device for storing the documents in the collection;
a lexical database; and a processing subsystem, the processing subsystem being operatively coupled to the storage device and the lexical database, the processing subsystem being programmed to access the documents in the storage device and:
using the lexical database to extract phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
9. The system for generating a summary as defined by claim 9, wherein the phrase intersection analysis processing further comprises:
representing the phrases as data structures having root nodes and children nodes;
selecting those data structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
10. The system of claim 9, wherein the data structure is a DSYNT tree structure.
11. The system of claim 9, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.
12. The system of claim 8, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
13 13. The system of claim 8, further comprising a phrase divergence processing operation.
14. The system of claim 8, wherein the processing subsystem includes a language generation engine and wherein sentence generation includes mapping phrases to an input format of the language generation engine and then operating the language generation engine.
15. The system of claim 8, wherein the storage device for storing the documents in the collection is remotely located from the processing subsystem.
16. A computer readable media for programming a computer system to perform a method of generating a summary of a plurality of related documents in a collection comprising:
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
17. The computer readable media of claim 16, wherein the phrase intersection analysis comprises:
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
18. The computer readable media of claim 17, wherein the tree structure is a DSYNT tree structure.
19. The computer readable media of claim 17, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.
20. The computer readable media of claim 16, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
21. The computer readable media of claim 16, further comprising a phrase divergence processing operation.
22. The computer readable media of claim 16, wherein the sentence generation includes mapping phrases to an input format of a language generation engine and operating the language generation engine.
CA2363017A 1999-02-19 2000-02-18 Multi-document summarization system and method Expired - Fee Related CA2363017C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US12065999P 1999-02-19 1999-02-19
US60/120,659 1999-02-19
PCT/US2000/004118 WO2000049517A2 (en) 1999-02-19 2000-02-18 Multi-document summarization system and method

Publications (2)

Publication Number Publication Date
CA2363017A1 true CA2363017A1 (en) 2000-08-24
CA2363017C CA2363017C (en) 2011-04-19

Family

ID=22391735

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2363017A Expired - Fee Related CA2363017C (en) 1999-02-19 2000-02-18 Multi-document summarization system and method

Country Status (6)

Country Link
EP (1) EP1190343A4 (en)
AU (1) AU775978B2 (en)
CA (1) CA2363017C (en)
HK (1) HK1045391A1 (en)
IL (2) IL144951A0 (en)
WO (1) WO2000049517A2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7027974B1 (en) * 2000-10-27 2006-04-11 Science Applications International Corporation Ontology-based parser for natural language processing
US6766316B2 (en) 2001-01-18 2004-07-20 Science Applications International Corporation Method and system of ranking and clustering for document indexing and retrieval
US7818117B2 (en) * 2007-06-20 2010-10-19 Amadeus S.A.S. System and method for integrating and displaying travel advices gathered from a plurality of reliable sources
US11374888B2 (en) 2015-09-25 2022-06-28 Microsoft Technology Licensing, Llc User-defined notification templates

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4965763A (en) * 1987-03-03 1990-10-23 International Business Machines Corporation Computer method for automatic extraction of commonly specified information from business correspondence
JP2783558B2 (en) * 1988-09-30 1998-08-06 株式会社東芝 Summary generation method and summary generation device
JPH0418673A (en) * 1990-05-11 1992-01-22 Hitachi Ltd Method and device for extracting text information
US5638543A (en) * 1993-06-03 1997-06-10 Xerox Corporation Method and apparatus for automatic document summarization
US5384703A (en) * 1993-07-02 1995-01-24 Xerox Corporation Method and apparatus for summarizing documents according to theme
US5689716A (en) * 1995-04-14 1997-11-18 Xerox Corporation Automatic method of generating thematic summaries
US5778397A (en) * 1995-06-28 1998-07-07 Xerox Corporation Automatic method of generating feature probabilities for automatic extracting summarization
US5838323A (en) * 1995-09-29 1998-11-17 Apple Computer, Inc. Document summary computer system user interface
US5848191A (en) * 1995-12-14 1998-12-08 Xerox Corporation Automatic method of generating thematic summaries from a document image without performing character recognition
US5924108A (en) * 1996-03-29 1999-07-13 Microsoft Corporation Document summarizer for word processors

Also Published As

Publication number Publication date
EP1190343A4 (en) 2006-08-09
HK1045391A1 (en) 2002-11-22
IL144951A0 (en) 2002-06-30
AU775978B2 (en) 2004-08-19
CA2363017C (en) 2011-04-19
WO2000049517A3 (en) 2000-11-30
WO2000049517A2 (en) 2000-08-24
EP1190343A2 (en) 2002-03-27
IL144951A (en) 2006-08-01
AU4002600A (en) 2000-09-04

Similar Documents

Publication Publication Date Title
AU2004218705B2 (en) System for identifying paraphrases using machine translation techniques
JP3906356B2 (en) Syntax analysis method and apparatus
US8185377B2 (en) Diagnostic evaluation of machine translators
US20060095250A1 (en) Parser for natural language processing
Fantinuoli et al. Creating and using multilingual corpora in translation studies
JP2005535007A (en) Synthesizing method of self-learning system for knowledge extraction for document retrieval system
WO2001096980A2 (en) Method and system for text analysis
JPH083815B2 (en) Natural language co-occurrence relation dictionary maintenance method
US7366711B1 (en) Multi-document summarization system and method
Sundblad Automatic acquisition of hyponyms and meronyms from question corpora
Chatterjee et al. DEPSYM: A Lightweight Syntactic Text Simplification Approach using Dependency Trees.
CA2363017A1 (en) Multi-document summarization system and method
Broda et al. Recognition of structured collocations in an inflective language
JP2003167898A (en) Information retrieving system
Mille et al. Making Text Resources Accessible to the Reader: the Case of Patent Claims.
WO1997048058A1 (en) Automated translation of annotated text
WO1997048058A9 (en) Automated translation of annotated text
Jacquemin A derivational rephrasing experiment for question answering
Eineborg et al. ILP in part-of-speech tagging—an overview
Tsutsumi A prototype English-Japanese machine translation system for translating IBM computer manuals
Poibeau et al. Generating extraction patterns from a large semantic network and an untagged corpus
Sarmento et al. Making RAPOSA (FOX) Smarter.
Rennes Improved Automatic Text Simplification by Manual Training
Yarmohammadi et al. SBUQA question answering system
Georgantopoulos et al. Term-based Identification of Sentences for Text Summarisation.

Legal Events

Date Code Title Description
EEER Examination request
MKLA Lapsed

Effective date: 20150218