CA2363017A1 - Multi-document summarization system and method - Google Patents
Multi-document summarization system and method Download PDFInfo
- Publication number
- CA2363017A1 CA2363017A1 CA002363017A CA2363017A CA2363017A1 CA 2363017 A1 CA2363017 A1 CA 2363017A1 CA 002363017 A CA002363017 A CA 002363017A CA 2363017 A CA2363017 A CA 2363017A CA 2363017 A1 CA2363017 A1 CA 2363017A1
- Authority
- CA
- Canada
- Prior art keywords
- phrases
- nodes
- phrase
- temporal
- root nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims 9
- 230000002123 temporal effect Effects 0.000 claims abstract 17
- 238000013507 mapping Methods 0.000 claims 3
- 239000003550 marker Substances 0.000 claims 3
- 230000009466 transformation Effects 0.000 claims 3
- 230000001427 coherent effect Effects 0.000 abstract 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
A summary for a collection of related documents can be generated by extracting phrases (100) from the documents which include common focus elements. Phrase intersection analysis (120) is then performed on the extracted phrases (100) to generate a phrase intersection table, where identical or equivalent phrases are identified. Temporal processing (140) on the phrases in the phrase intersection table is performed to remove ambiguous time references and to sort the phrases in a temporal sequence. Sentence generation is then used to combine the phrases in the phrase intersection table into a coherent summary.
Claims (22)
1. A method for generating a summary of a plurality of related documents in a collection comprising:
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
2. The method of generating a summary as defined by claim 1, wherein the phrase intersection analysis comprises:
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
3. The method of claim 2, wherein the tree structure is a DSYNT tree structure.
4. The method of claim 2, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.
5. The method of claim 1, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
6. The method of claim 1, further comprising a phrase divergence processing operation.
7. The method of claim 1, wherein the sentence generation includes mapping phrases to an input format of a language generation engine and operating the language generation engine.
8. A system for generating a summary of a plurality of related documents in a collection comprising:
a storage device for storing the documents in the collection;
a lexical database; and a processing subsystem, the processing subsystem being operatively coupled to the storage device and the lexical database, the processing subsystem being programmed to access the documents in the storage device and:
using the lexical database to extract phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
a storage device for storing the documents in the collection;
a lexical database; and a processing subsystem, the processing subsystem being operatively coupled to the storage device and the lexical database, the processing subsystem being programmed to access the documents in the storage device and:
using the lexical database to extract phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
9. The system for generating a summary as defined by claim 9, wherein the phrase intersection analysis processing further comprises:
representing the phrases as data structures having root nodes and children nodes;
selecting those data structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
representing the phrases as data structures having root nodes and children nodes;
selecting those data structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
10. The system of claim 9, wherein the data structure is a DSYNT tree structure.
11. The system of claim 9, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.
12. The system of claim 8, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
13 13. The system of claim 8, further comprising a phrase divergence processing operation.
14. The system of claim 8, wherein the processing subsystem includes a language generation engine and wherein sentence generation includes mapping phrases to an input format of the language generation engine and then operating the language generation engine.
15. The system of claim 8, wherein the storage device for storing the documents in the collection is remotely located from the processing subsystem.
16. A computer readable media for programming a computer system to perform a method of generating a summary of a plurality of related documents in a collection comprising:
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.
17. The computer readable media of claim 16, wherein the phrase intersection analysis comprises:
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.
18. The computer readable media of claim 17, wherein the tree structure is a DSYNT tree structure.
19. The computer readable media of claim 17, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.
20. The computer readable media of claim 16, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.
21. The computer readable media of claim 16, further comprising a phrase divergence processing operation.
22. The computer readable media of claim 16, wherein the sentence generation includes mapping phrases to an input format of a language generation engine and operating the language generation engine.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12065999P | 1999-02-19 | 1999-02-19 | |
US60/120,659 | 1999-02-19 | ||
PCT/US2000/004118 WO2000049517A2 (en) | 1999-02-19 | 2000-02-18 | Multi-document summarization system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CA2363017A1 true CA2363017A1 (en) | 2000-08-24 |
CA2363017C CA2363017C (en) | 2011-04-19 |
Family
ID=22391735
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA2363017A Expired - Fee Related CA2363017C (en) | 1999-02-19 | 2000-02-18 | Multi-document summarization system and method |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP1190343A4 (en) |
AU (1) | AU775978B2 (en) |
CA (1) | CA2363017C (en) |
HK (1) | HK1045391A1 (en) |
IL (2) | IL144951A0 (en) |
WO (1) | WO2000049517A2 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7027974B1 (en) * | 2000-10-27 | 2006-04-11 | Science Applications International Corporation | Ontology-based parser for natural language processing |
US6766316B2 (en) | 2001-01-18 | 2004-07-20 | Science Applications International Corporation | Method and system of ranking and clustering for document indexing and retrieval |
US7818117B2 (en) * | 2007-06-20 | 2010-10-19 | Amadeus S.A.S. | System and method for integrating and displaying travel advices gathered from a plurality of reliable sources |
US11374888B2 (en) | 2015-09-25 | 2022-06-28 | Microsoft Technology Licensing, Llc | User-defined notification templates |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4965763A (en) * | 1987-03-03 | 1990-10-23 | International Business Machines Corporation | Computer method for automatic extraction of commonly specified information from business correspondence |
JP2783558B2 (en) * | 1988-09-30 | 1998-08-06 | 株式会社東芝 | Summary generation method and summary generation device |
JPH0418673A (en) * | 1990-05-11 | 1992-01-22 | Hitachi Ltd | Method and device for extracting text information |
US5638543A (en) * | 1993-06-03 | 1997-06-10 | Xerox Corporation | Method and apparatus for automatic document summarization |
US5384703A (en) * | 1993-07-02 | 1995-01-24 | Xerox Corporation | Method and apparatus for summarizing documents according to theme |
US5689716A (en) * | 1995-04-14 | 1997-11-18 | Xerox Corporation | Automatic method of generating thematic summaries |
US5778397A (en) * | 1995-06-28 | 1998-07-07 | Xerox Corporation | Automatic method of generating feature probabilities for automatic extracting summarization |
US5838323A (en) * | 1995-09-29 | 1998-11-17 | Apple Computer, Inc. | Document summary computer system user interface |
US5848191A (en) * | 1995-12-14 | 1998-12-08 | Xerox Corporation | Automatic method of generating thematic summaries from a document image without performing character recognition |
US5924108A (en) * | 1996-03-29 | 1999-07-13 | Microsoft Corporation | Document summarizer for word processors |
-
2000
- 2000-02-18 CA CA2363017A patent/CA2363017C/en not_active Expired - Fee Related
- 2000-02-18 IL IL14495100A patent/IL144951A0/en active IP Right Grant
- 2000-02-18 WO PCT/US2000/004118 patent/WO2000049517A2/en active Application Filing
- 2000-02-18 AU AU40026/00A patent/AU775978B2/en not_active Ceased
- 2000-02-18 EP EP00919318A patent/EP1190343A4/en not_active Ceased
-
2001
- 2001-08-16 IL IL144951A patent/IL144951A/en not_active IP Right Cessation
-
2002
- 2002-09-25 HK HK02106992.3A patent/HK1045391A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
EP1190343A4 (en) | 2006-08-09 |
HK1045391A1 (en) | 2002-11-22 |
IL144951A0 (en) | 2002-06-30 |
AU775978B2 (en) | 2004-08-19 |
CA2363017C (en) | 2011-04-19 |
WO2000049517A3 (en) | 2000-11-30 |
WO2000049517A2 (en) | 2000-08-24 |
EP1190343A2 (en) | 2002-03-27 |
IL144951A (en) | 2006-08-01 |
AU4002600A (en) | 2000-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2004218705B2 (en) | System for identifying paraphrases using machine translation techniques | |
JP3906356B2 (en) | Syntax analysis method and apparatus | |
US8185377B2 (en) | Diagnostic evaluation of machine translators | |
US20060095250A1 (en) | Parser for natural language processing | |
Fantinuoli et al. | Creating and using multilingual corpora in translation studies | |
JP2005535007A (en) | Synthesizing method of self-learning system for knowledge extraction for document retrieval system | |
WO2001096980A2 (en) | Method and system for text analysis | |
JPH083815B2 (en) | Natural language co-occurrence relation dictionary maintenance method | |
US7366711B1 (en) | Multi-document summarization system and method | |
Sundblad | Automatic acquisition of hyponyms and meronyms from question corpora | |
Chatterjee et al. | DEPSYM: A Lightweight Syntactic Text Simplification Approach using Dependency Trees. | |
CA2363017A1 (en) | Multi-document summarization system and method | |
Broda et al. | Recognition of structured collocations in an inflective language | |
JP2003167898A (en) | Information retrieving system | |
Mille et al. | Making Text Resources Accessible to the Reader: the Case of Patent Claims. | |
WO1997048058A1 (en) | Automated translation of annotated text | |
WO1997048058A9 (en) | Automated translation of annotated text | |
Jacquemin | A derivational rephrasing experiment for question answering | |
Eineborg et al. | ILP in part-of-speech tagging—an overview | |
Tsutsumi | A prototype English-Japanese machine translation system for translating IBM computer manuals | |
Poibeau et al. | Generating extraction patterns from a large semantic network and an untagged corpus | |
Sarmento et al. | Making RAPOSA (FOX) Smarter. | |
Rennes | Improved Automatic Text Simplification by Manual Training | |
Yarmohammadi et al. | SBUQA question answering system | |
Georgantopoulos et al. | Term-based Identification of Sentences for Text Summarisation. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
MKLA | Lapsed |
Effective date: 20150218 |