CA2363017A1

CA2363017A1 - Multi-document summarization system and method

Info

Publication number: CA2363017A1
Application number: CA002363017A
Authority: CA
Inventors: Kathleen R. Mckeown; Regina Barzilay
Original assignee: Individual
Current assignee: Columbia University in the City of New York
Priority date: 1999-02-19
Filing date: 2000-02-18
Publication date: 2000-08-24
Anticipated expiration: 2020-02-18
Also published as: EP1190343A4; HK1045391A1; IL144951A0; AU775978B2; CA2363017C; WO2000049517A3; WO2000049517A2; EP1190343A2; IL144951A; AU4002600A

Abstract

A summary for a collection of related documents can be generated by extracting phrases (100) from the documents which include common focus elements. Phrase intersection analysis (120) is then performed on the extracted phrases (100) to generate a phrase intersection table, where identical or equivalent phrases are identified. Temporal processing (140) on the phrases in the phrase intersection table is performed to remove ambiguous time references and to sort the phrases in a temporal sequence. Sentence generation is then used to combine the phrases in the phrase intersection table into a coherent summary.

Claims

1. A method for generating a summary of a plurality of related documents in a collection comprising:
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.

2. The method of generating a summary as defined by claim 1, wherein the phrase intersection analysis comprises:
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.

3. The method of claim 2, wherein the tree structure is a DSYNT tree structure.

4. The method of claim 2, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.

5. The method of claim 1, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.

6. The method of claim 1, further comprising a phrase divergence processing operation.

7. The method of claim 1, wherein the sentence generation includes mapping phrases to an input format of a language generation engine and operating the language generation engine.

8. A system for generating a summary of a plurality of related documents in a collection comprising:
a storage device for storing the documents in the collection;
a lexical database; and a processing subsystem, the processing subsystem being operatively coupled to the storage device and the lexical database, the processing subsystem being programmed to access the documents in the storage device and:
using the lexical database to extract phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.

9. The system for generating a summary as defined by claim 9, wherein the phrase intersection analysis processing further comprises:
representing the phrases as data structures having root nodes and children nodes;
selecting those data structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.

10. The system of claim 9, wherein the data structure is a DSYNT tree structure.

11. The system of claim 9, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.

12. The system of claim 8, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.

13 13. The system of claim 8, further comprising a phrase divergence processing operation.

14. The system of claim 8, wherein the processing subsystem includes a language generation engine and wherein sentence generation includes mapping phrases to an input format of the language generation engine and then operating the language generation engine.

15. The system of claim 8, wherein the storage device for storing the documents in the collection is remotely located from the processing subsystem.

16. A computer readable media for programming a computer system to perform a method of generating a summary of a plurality of related documents in a collection comprising:
extracting phrases having focus elements from the plurality of documents;
performing phrase intersection analysis on the extracted phrases to generate a phrase intersection table;
performing temporal processing on the phrases in the phrase intersection table; and performing sentence generation using the phrases in the phrase intersection table.

17. The computer readable media of claim 16, wherein the phrase intersection analysis comprises:
representing the phrases in tree structures having root nodes and children nodes;
selecting those tree structures with verb root nodes;
comparing the selected root nodes to the other root nodes to identify identical nodes;
applying paraphrasing rules to non-identical root nodes to determine if non identical nodes are equivalent; and evaluating the children nodes of those tree structures where the parent nodes are identical or equivalent.

18. The computer readable media of claim 17, wherein the tree structure is a DSYNT tree structure.

19. The computer readable media of claim 17, wherein the paraphrasing rules are selected from the group consisting of ordering of sentence components, main clause versus a relative clause, different syntactic categories, change in grammatical features, omission of an empty head, transformation of one part of speech to another, and semantically related words.

20. The computer readable media of claim 16, wherein the temporal processing includes:
time stamping phrases based on a first occurrence of the phrase in the collection;
substituting date certain references for ambiguous temporal references;
ordering the phrases based on the time stamp; and inserting a temporal marker if a temporal gap between phrases exceeds a threshold value.

21. The computer readable media of claim 16, further comprising a phrase divergence processing operation.

22. The computer readable media of claim 16, wherein the sentence generation includes mapping phrases to an input format of a language generation engine and operating the language generation engine.