WO2001006408A1 - Cut and paste document summarization system and method - Google Patents

Cut and paste document summarization system and method Download PDF

Info

Publication number
WO2001006408A1
WO2001006408A1 PCT/US2000/004505 US0004505W WO0106408A1 WO 2001006408 A1 WO2001006408 A1 WO 2001006408A1 US 0004505 W US0004505 W US 0004505W WO 0106408 A1 WO0106408 A1 WO 0106408A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
generating
components
module
corpus
Prior art date
Application number
PCT/US2000/004505
Other languages
French (fr)
Inventor
Kathleen R. Mckeown
Hongyan Jing
Original Assignee
The Trustees Of Columbia University In The City Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Trustees Of Columbia University In The City Of New York filed Critical The Trustees Of Columbia University In The City Of New York
Priority to EP00915831A priority Critical patent/EP1208455A4/en
Priority to AU37038/00A priority patent/AU778394B2/en
Priority to CA002363834A priority patent/CA2363834A1/en
Priority to IL14495000A priority patent/IL144950A0/en
Publication of WO2001006408A1 publication Critical patent/WO2001006408A1/en
Priority to HK02108117.9A priority patent/HK1046570A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Definitions

  • the present invention relates generally to information summarization and more particularly relates to systems and methods for generating a summary of a document using automated cutting and pasting of the input document.
  • a present method for generating a summary of an input document includes extracting at least one sentence from the document.
  • the extracted sentences are parsed into components, preferably in a parse tree representation.
  • Sentence reduction is performed to mark components which can be removed from the extracted sentences.
  • Sentence combination is performed to mark components of two or more sentences which can be merged.
  • Sentence combination also includes a paste operation to operate on the marked components to effect the indicated removal and combination of sentence components.
  • a preferred sentence reduction operation includes measuring the contextual importance of the components; measuring the probabilistic importance of the components based on a given corpus; measuring the importance of the components based on linguistic knowledge; synthesizing the contextual, probabilistic and knowledge based importance measures into a relative importance score for each component; and marking for removal those components with an importance score below a threshold value.
  • the contextual importance can be measured by establishing a plurality of lexical links of at least one type among the components in a local context in the document and computing a context importance score according to the type, number and direction of lexical links associated with each component.
  • the types of lexical links can include repetition, inflectional variants, derivational variants, synonyms, hypernyms, antonyms, part-of, entailment, and causative links.
  • the sentence combination operation includes identifying sentence combination operations from a sentence combination subcorpus and developing rules regarding the application of the sentence combination operations.
  • the combination rules are then applied to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article.
  • the sentence combination operations can be selected from the group including add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations.
  • a present system for generating a summary of an input document includes an extraction module which receives the input document and extracts at least one sentence related to a focus of the document.
  • a summary sentence generation module is provided, which generally includes a sentence reduction module and a sentence combination module.
  • the system includes a grammatical parser operatively coupled to the generation module for parsing the extracted sentences into components in a grammatical representation.
  • a combined lexicon and a corpus of human generated summaries are operatively coupled to the generation module for use by the operational modules during summary generation.
  • the corpus can further include a sentence generation subcorpus and a sentence reduction subcorpus.
  • the subcorpora can be generated manually or through the use of a decomposition module.
  • the sentence reduction module is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation.
  • Context importance processing can include establishing a plurality of lexical links of at least one type for the components and generating a context importance score based on the type and number of links associated with the components.
  • the number and type of lexical links can vary, however a preferred set of lexical link types includes repetition, inflectional variants, derivational variants, synonyms, hypernyms, antonyms, part-of, entailment, and causative links.
  • the sentence reduction module further computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon.
  • the sentence reduction module can also be cooperatively engaged with the corpus and perform probabilistic importance processing on the components of the grammatical representation in accordance with the particular corpus used.
  • the sentence combination module can be used to identify sentence combination operations from a sentence combination subcorpus and develop rules regarding the application of the sentence combination operations.
  • the combination module applies the combination rules to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article.
  • a decomposition module in accordance with the present application can be used to evaluate human generated summaries and map corresponding portions of the summaries to the original documents.
  • the decomposition module indexes words in the summary and the original document.
  • a Hidden Markov Model is then built based on heuristic rules to determine the probability of phrases in the summary sentence matching a given phrase in the original document.
  • a Viterbi algorithm can then be employed to determine the best solution for the Hidden Markov Model and generate a mapping between summary phrases and the original document. This mapping can be used to generate, among other things, a sentence reduction subcorpus and a sentence combination subcorpus.
  • Such a decomposition module can be operatively coupled to the corpus in the summary generation system described above. Brief Description of the Drawing
  • FIG. 1 is a block diagram of the system architecture of the present document summarization system
  • Figure 2 is a flow chart illustrating an exemplary embodiment of a sentence reduction operation in accordance with the summarization system of Figure i;
  • Figure 3 is a pictorial diagram of an exemplary parse tree sentence representation;
  • Figure 4 is a flow chart illustrating an exemplary embodiment of a sentence combination operation in accordance with the present summarization system of Figure 1;
  • Figure 5 is a table illustrating exemplary sentence combination operations for the sentence combination operation of Figure 4;
  • Figure 6 is a table illustrating exemplary sentence combination rules for applying the sentence combination operations of Figure 5;
  • Figure 7 is a flow diagram illustrating the operation of the corpus decomposition module of Figure 1;
  • Figure 8 is a pictorial diagram of a Hidden Markov Model for use in a corpus decomposition module.
  • FIG. 1 is a block diagram illustrating the system architecture of an exemplary embodiment of the present summarization system. Such a system can be implemented on various computer hardware, software and operating system platforms. The particular system components selected are not critical to the practice of the present invention.
  • the present system of Figure 1 can be implemented on a personal computer system, such as an IBM compatible system.
  • an input document 105 in computer readable form is applied to an extraction module 110 which determines the focus of the document 105 and extracts sentences from the document accordingly.
  • a number of extraction techniques can be used in the extraction module 110.
  • the extraction module 110 links words in a sentence to other words in the input document 105 through repetitions, morphological relations and lexical relations.
  • An importance score can then be computed for each word in the article 105 based on the number, type and direction (forward, backward) of the lexical links associated with the word.
  • a sentence score can be determined by adding the importance score for each of the words in the sentence and normalizing the sum based on the number of words in the sentence. The sentences can then extracted based on the highest relative sentence scores.
  • the extraction module 110 provides the extracted sentences 115 to a generation module 120.
  • the generation module 120 also receives the original document 105 as an input.
  • the generation module 120 further includes a sentence reduction module 135 and a sentence combination module 140.
  • the sentence reduction module 135 provides a marked up parse tree as input data for the sentence combination module 140, which generates and outputs the summary sentences.
  • the generation module 120 is operatively coupled to a corpus of human- written summaries 165, a lexical database 170, and a combined reusable lexicon 175.
  • the corpus 165 generally includes a broad collection of human- generated summaries as well as the corresponding original documents.
  • the corpus 165 can also include a sentence reduction subcorpus 165a and a sentence combination subcorpus 165b which can be generated manually or through a decomposition module.
  • the sentence reduction subcorpus 165a includes entries of sentence pairs linking an original sentence to a human reduced sentence.
  • the sentence combination subcorpus 165b includes mappings from human combined sentences to two or more original sentences.
  • a suitable exemplary corpus 165 was generated using Communications-related Headlines, a free daily online news service provided by the Benton Foundation (http://www.benton.org). The articles from this particular service are communication related, but the topics involved are very broad, including law, company mergers, new technologies, labor issues and so on. Of course, other sources of document summaries can also be used to generate a suitable corpus. To insure that the resulting corpus is somewhat generic, the articles from the selected source should not possess a particular writing style. Thus, preferred sources feature articles from multiple sources or articles from various sections of one or more source.
  • a suitable corpus 165 was generated in four major steps. First, human-written, single document summaries are received from the source. Second, the original documents are retrieved and correlated to the respective summary.
  • the retrieved documents are then "cleaned” by removing irrelevant material such as indexes and advertisements. Finally, the quality of the correspondence between the summary and the original document is verified.
  • the cleaning and verification processes are generally performed manually.
  • the sentence reduction subcorpus 165a and sentence combination subcorpus 165b entries were generated by the decomposition module 185, the operation of which is explained below.
  • the lexical database 170 can take the form of the WordNet database, which is described in the article "WordNet: A lexical Database for English", by G.A. Miller, Communications of the ACM, Vol. 38, No. 11. pp. 39-41, November 1995.
  • a suitable embodiment of the combined lexicon 175 can be constructed by combining multiple, large-scale resources such as WordNet, the English Verb Classes and Alternations (EVCA) database, the COMLEX syntax dictionary and the Brown Corpus tagged with WordNet senses.
  • the combined lexicon 175 can be formed by encoding the EVCA database with COMLEX compatible syntax and merging the EVCA into the COMLEX database.
  • WordNet is added to the EVCA/COMLEX combination to refine the syntactic information and provide additional lexical information to the lexicon 175.
  • the generation module 120 is also cooperatively coupled to natural language processing (NLP) tools such as a syntactic parser 180 and a co-reference resolving module 190 which can include anaphora resolution. These tools can be software modules which are called by the generation module 120.
  • NLP natural language processing
  • a suitable syntactic parser 180 is the English Slot Grammar (ESG) parser available from International Business Machines, Inc.
  • ESG English Slot Grammar
  • a suitable co-reference resolving module 190 is the Deep Read system, available from Mitre, Inc.
  • Figure 2 is a flow diagram further illustrating the operation of the sentence reduction module 135.
  • the reduction module 135 receives extracted sentences 115 as input (step 205).
  • the reduction module invokes the parser 180 to grammatically parse the extracted sentences 115 and generate a parse tree representation of the sentences (step 210).
  • contextual importance is determined by detecting lexical links among words in a local context and then computing an importance score based on the number, type and direction of lexical links detected.
  • the context processing step 215 generates an importance score for each node in the parse tree indicating the relative importance of the nodes to the focus of the input document 105.
  • the number, type and direction (forward, backward) of lexical links used in the practice of the present invention may vary.
  • An empirical study has demonstrated that the following nine lexical relation types provide a meaningful representation of contextual importance: (1) repetition, (2) inflectional variants, (3) derivational variants, (4) synonyms, (5) hypernyms, (6) antonyms, (7) part-of, (8) entailment (for example: kill ⁇ die), and (9) causative (for example: eat ⁇ chew).
  • Inflectional variants (2) and derivational variants can be derived from the CELEX database content, available from the Centre for Lexical Information, Max Planck Institute for Psycholinquistics, Nijmegen, which can be in the combined lexicon 175.
  • the other lexical relations can be extracted using the separate lexical database 170, such as WordNet.
  • WordNet a number of sentences before and after the current sentence location are evaluated for the presence of lexical links.
  • the number of sentences selected for this operation involves balancing the level of contextual depth to the amount of processing overhead. Using the five sentences before and the five sentences after the current sentence has been found to provide reasonable local context without incurring excessive processing overhead.
  • an importance score for each word in the extracted sentences can be calculated (step 215 b).
  • Lexical links from the current sentence to subsequent sentences are referred to as forward links and those from the current sentence to preceding sentences are referred to as backward links.
  • the importance score referred to as the context weight, can be computed as follows:
  • ForwardWeight(w) computes the weight of forward links
  • BackwardWeight(w) computes the weight of backward links
  • TotalWeight(w) represents the sum of all links
  • Ratio(w) computes a weight for the location of the word.
  • each type of link is assigned a weighted value according to its relative importance. For example, the nine lexical relations set forth above were presented in descending order of importance and accordingly can be assigned linearly decreasing weights such as (1,0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2).
  • the value of Ratiofw) represents the value assigned based on the location of the word in the original document. For example, when a sentence introduces a topic or ends a topic, it is considered more important and the components of those sentences will be assigned a relatively higher location value.
  • the reduction module 135 can perform interdependency processing using a probability analysis based on the corpus 165 of human- written reduction based sentences. Such an analysis can indicate the degree of correlation between components in a sentence, such as the relationship between a verb and its subclause.
  • the probability computation can be performed based on parse trees using probabilities to indicate the degree of correlation between a parent node and its child nodes in the parse tree.
  • Figure 3 illustrates an exemplary fragment of a parse tree used to explain the operation of the probability computation.
  • the main verb "give" is the parent nodes 300, and it has four children nodes: subclause conjunct 305, subject 310, indirect object 315 and object 320, respectively.
  • the parse tree can also include further levels below the children nodes, such as nodes ndet 325 and adjp 330 below child node obj 320 and nodes lconj 335 and rconj 340 below node adjp 330, respectively.
  • This conditional probability is transformed using Bayes's rule to:
  • the probability associated with the other child nodes from the current root node is calculated in a similar manner. After the probabilities for each of the first level child nodes is calculated, each of the child nodes in the current level of the tree is then treated as a parent node and the process is repeated through each descending level of the parse tree until every parent-child node pair has been considered.
  • the probabilities for the corpus 165 can be calculated and stored in a look-up table which is used when a reduction module 135 is run.
  • step 220 provide a relative ranking of sentence components. However, this ranking does not necessarily provide a measure of which components be included to provide a grammatically correct summary sentence.
  • reduction processing based on linguistic knowledge is performed (step 225). In this operation, the reduction module 135 works in cooperation with the combined lexicon 175.
  • the linguistic knowledge processing step 225 operates with the combined lexicon 175 to evaluate the parse tree for each extracted sentence 115 and determine which children nodes are essential to maintain the grammatical correctness of the component represented by the parent node. Linguistic judgments are identified in the parse tree by assigning a binary tag to each node in the parse tree. The value of a tag is either essential or reducible, indicating whether or not a node is indispensable to its parent node. For example, referring to Figure 3, the lexicon 175 will indicate that the verb give needs a subject and two objects. Thus the child nodes subj 310, iobj 315 and obj 320 can be marked as essential. In this case, the child node subclause 305 is then rendered non-essential and will be marked as reducible.
  • the lexicon 175 can also include collocations, such as consist of or replace .... with ...., which prevents removal of indispensable components.
  • a reduction operation can take place.
  • the reduction operation process can be viewed as a series of decision making steps along the edges of a parse tree. Beginning with the root node of the parse tree, the immediate child nodes are evaluated to determine which child nodes can be removed. A child node can be removed if three conditions are satisfied.
  • the first condition is that the component is not a local focus. To determine whether a component is a local focus, the ratio of the context importance score (step 215b) of the child node to that of the root node is calculated. The child node is then considered unimportant if the calculated ratio is smaller than a threshold value.
  • the second condition is that the corpus probability value (step 220) indicating that the special syntactic component of the root is removed is higher than a threshold.
  • the final condition is that the linguistic analysis in step 225 indicates that the child node as reducible.
  • the child node When the conditions to remove a child node are satisfied, the child node is tagged as "removable" and processing on that branch of the tree terminates. For the child nodes which are retained, the lower levels of the parse tree are evaluated by repeating this process in a similar manner through the tree.
  • the reduction operation step 230 is complete when there are no more nodes to consider. This also concludes processing of the sentence reduction module and results in the parse trees being marked with those components which can be removed or altered by the subsequent paste module 150 operation.
  • processing by the sentence reduction module 135, processing by the sentence combination module 140 is performed.
  • the operation of the sentence combination module 140 is further illustrated in the flow chart of Figure 4.
  • Figure 5 is a table illustrating combination operations such as: add descriptions 510, aggregations 515, substitute incoherent phrases 520, substitute phrases with more general or more specific information 525 and mixed operations 530.
  • sentence combination rules are also established to determine whether and how the sentence combination operations of step 410 will take place (step 415).
  • the result is a set of sentence combination rules 420, such as those set forth in Figure 6.
  • the rules illustrated in Figure 6 are exemplary and non-exhaustive. These sentence combination rules 420 were determined empirically by manual inspection of the sentence combination subcorpus 165b.
  • the sentence combination module 140 in cooperation with the co-reference resolution module 190 applies the sentence combination rules 420 (step 425).
  • the result of step 425 is that the parse trees of the sentences being combined are appropriately tagged to effect the sentence combination.
  • the combination operation is then realized in step 430 using a tree adjoining grammar (TAG) formalism, as described by A. Joshi, "Introduction to Tree- Adjoining
  • the sentence combination module 140 performs a paste operation on the marked parse trees and generates a summary sentence.
  • the document summary is generated by combining the summary sentences.
  • the most straight forward combination is to maintain the order of sentences as they were extracted, however, other sequencing arrangements can also be employed.
  • the corpus decomposition module 185 operates on the corpus 165 to generate the sentence reduction subcorpus 165a and the sentence combination subcorpus 165b.
  • the decomposition module 185 generally operates to evaluate the human written summaries in the corpus 165, compare the summary sentences to the original document, determine if a summary sentence was generated by a cut and past operation and identify where the components of the summary sentences were taken from in the original documents.
  • the operation of the decomposition module 185 is illustrated in the flow diagram of Figure 7.
  • the decomposition module 185 uses the human- generated summary and original document as inputs to an indexing operation (step 705).
  • indexing each word in the original document is indexed according to its positions in the original document. A convenient way of referencing these occurrences is by sentence number and word number in the original document.
  • a set of heuristic rules is developed by manual inspection of the corpus 165. Such inspection reveals that human-generated summaries often include one or more of six operations: sentence reduction, sentence combination, syntactic transformation, lexical paraphrasing, generalization/specification, and content reordering.
  • the probability values can be assigned in the following manner:
  • a Hidden Markov Model can be generated, such as is illustrated in Figure 8 (step 710).
  • the nodes in the Hidden Markov Model represent possible positions in the document, and the edges output the probability of going from one node to another. This Hidden Markov Model is used in finding the most likely position sequence in a subsequent processing operation.
  • Assigning values to P1-P6 is performed empirically. For example, the maximal value can be assigned 1 and others are assigned evenly decreasing values 0.9, 0.8 and so on. The order of the above rules is based on the empirical observations on a particular set of summaries. These values, however, can be adjusted or even trained for different corpora.
  • a Viterbi algorithm can be used to evaluate the Hidden Markov Model and find the most likely sequence of words incrementally (step 715).
  • the Viterbi algorithm first finds the most likely sequence for (Word,Word 2 ), for each possible position of Word 2 . This information is then used to compute the most likely sequence for (Word, Word 2 Word 3 ), for each possible position of Word 3 . The process repeats until all the words in the sequence have been considered.
  • post-editing operations can be used to cancel mismatches that occur in the corpus analysis.
  • the result is that summary sentences are matched to the corresponding phrases in the document. Once the summary sentences are so matched, it is a simple endeavor to sort the various matchings to one of the sentence reduction subcorpus 165a and sentence combination subcorpus 165b.
  • the decomposition module 185 can be used as a stand alone tool, apart from the rest of the present summary generation system, to perform various summary analysis operations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A summary of an input document is generated by extracting at least one sentence from the document and parsing the extracted sentences into components, such as in a parse tree (110). Sentence reduction processing is performed to mark components which can be removed from the parse trees (135). Sentence reduction can include context importance processing, probabilistic processing, and linguistic knowledge based processing, probabilistic processing includes identifying sentence combination operations and establishing rules for applying the sentence combination operations to mark the parse trees to merge at least two sentences (140). Sentence combination processing also provides a paste operation to operate on the marked components to effect the indicated removal and combination of sentence components, thereby generating summary sentences for the input document.

Description

CUT AND PASTE DOCUMENT SUMMARIZATION SYSTEM AND METHOD
Statement of Government Rights
The United States Government may have certain rights to the invention set forth herein pursuant to a grant by the National Science Foundation, Contract No. IRI-96-198124
Statement of Related Applications
This application claims the benefit of United States provisional patent application, Serial No. 60/120,657, entitled "Summary Generation Through Intelligent Cutting and Pasting of the Input Document" which was filed on February 19, 1999.
Field of the Invention
The present invention relates generally to information summarization and more particularly relates to systems and methods for generating a summary of a document using automated cutting and pasting of the input document.
Background of the Invention
The amount of information available today drastically exceeds that of any other time in history. With the continuing expansion of the Internet, this trend will likely continue well into the future. Often, people conducting research of a topic are faced with information overload as the number of potentially relevant documents exceeds the researcher's ability to individually review each document. To address this problem, information summaries are often relied on by researchers to quickly evaluate a document to determine if it is truly relevant to the problem at hand.
Given the vast collection of documents available, there is interest in developing and improving the systems and methods used to summarize information content. For individual documents, domain-dependent template based systems and domain-independent sentence extraction methods are known. Such known systems can provide a reasonable summary of a single document when the domain is known.
Many presently available summarizers extract sentences from the original documents to produce summaries. However, since the sentences are generally extracted without supporting context information, the resulting summaries can be incoherent, and in some cases, can convey misleading information.
Therefore, there remains a need for systems and methods which can generate a more readable and concise summary of a document.
Summary of the Invention It is an object of the present invention to provide a system and method for generating a summary of a document.
It is another object of the present invention to provide a summarization system which extracts sentences from an input document and then transforms the extracted sentences such that a concise, coherent and accurate summary results. It is a further object of the present invention to provide a system and method for generating a summary of a set document which use automated cutting and pasting of the input document.
A present method for generating a summary of an input document includes extracting at least one sentence from the document. The extracted sentences are parsed into components, preferably in a parse tree representation. Sentence reduction is performed to mark components which can be removed from the extracted sentences. Sentence combination is performed to mark components of two or more sentences which can be merged. Sentence combination also includes a paste operation to operate on the marked components to effect the indicated removal and combination of sentence components.
A preferred sentence reduction operation includes measuring the contextual importance of the components; measuring the probabilistic importance of the components based on a given corpus; measuring the importance of the components based on linguistic knowledge; synthesizing the contextual, probabilistic and knowledge based importance measures into a relative importance score for each component; and marking for removal those components with an importance score below a threshold value.
The contextual importance can be measured by establishing a plurality of lexical links of at least one type among the components in a local context in the document and computing a context importance score according to the type, number and direction of lexical links associated with each component. The types of lexical links can include repetition, inflectional variants, derivational variants, synonyms, hypernyms, antonyms, part-of, entailment, and causative links.
In a preferred method, the sentence combination operation includes identifying sentence combination operations from a sentence combination subcorpus and developing rules regarding the application of the sentence combination operations. The combination rules are then applied to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article. The sentence combination operations can be selected from the group including add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations.
A present system for generating a summary of an input document includes an extraction module which receives the input document and extracts at least one sentence related to a focus of the document. A summary sentence generation module is provided, which generally includes a sentence reduction module and a sentence combination module. The system includes a grammatical parser operatively coupled to the generation module for parsing the extracted sentences into components in a grammatical representation. A combined lexicon and a corpus of human generated summaries are operatively coupled to the generation module for use by the operational modules during summary generation.
The corpus can further include a sentence generation subcorpus and a sentence reduction subcorpus. The subcorpora can be generated manually or through the use of a decomposition module.
Preferably, the sentence reduction module is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation. Context importance processing can include establishing a plurality of lexical links of at least one type for the components and generating a context importance score based on the type and number of links associated with the components. The number and type of lexical links can vary, however a preferred set of lexical link types includes repetition, inflectional variants, derivational variants, synonyms, hypernyms, antonyms, part-of, entailment, and causative links.
Preferably, the sentence reduction module further computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon. The sentence reduction module can also be cooperatively engaged with the corpus and perform probabilistic importance processing on the components of the grammatical representation in accordance with the particular corpus used.
The sentence combination module can be used to identify sentence combination operations from a sentence combination subcorpus and develop rules regarding the application of the sentence combination operations. The combination module applies the combination rules to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article.
A decomposition module in accordance with the present application can be used to evaluate human generated summaries and map corresponding portions of the summaries to the original documents. The decomposition module indexes words in the summary and the original document. A Hidden Markov Model is then built based on heuristic rules to determine the probability of phrases in the summary sentence matching a given phrase in the original document. A Viterbi algorithm can then be employed to determine the best solution for the Hidden Markov Model and generate a mapping between summary phrases and the original document. This mapping can be used to generate, among other things, a sentence reduction subcorpus and a sentence combination subcorpus. Such a decomposition module can be operatively coupled to the corpus in the summary generation system described above. Brief Description of the Drawing
Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which Figure 1 is a block diagram of the system architecture of the present document summarization system;
Figure 2 is a flow chart illustrating an exemplary embodiment of a sentence reduction operation in accordance with the summarization system of Figure i; Figure 3 is a pictorial diagram of an exemplary parse tree sentence representation;
Figure 4 is a flow chart illustrating an exemplary embodiment of a sentence combination operation in accordance with the present summarization system of Figure 1; Figure 5 is a table illustrating exemplary sentence combination operations for the sentence combination operation of Figure 4;
Figure 6 is a table illustrating exemplary sentence combination rules for applying the sentence combination operations of Figure 5;
Figure 7 is a flow diagram illustrating the operation of the corpus decomposition module of Figure 1; and
Figure 8 is a pictorial diagram of a Hidden Markov Model for use in a corpus decomposition module.
Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims. Detailed Description of Preferred Embodiments
The present summarization systems and methods generate a generic, domain-independent single-document summary of a received input document. Figure 1 is a block diagram illustrating the system architecture of an exemplary embodiment of the present summarization system. Such a system can be implemented on various computer hardware, software and operating system platforms. The particular system components selected are not critical to the practice of the present invention. For example, the present system of Figure 1, can be implemented on a personal computer system, such as an IBM compatible system. Referring to Figure 1, an input document 105 in computer readable form is applied to an extraction module 110 which determines the focus of the document 105 and extracts sentences from the document accordingly. A number of extraction techniques can be used in the extraction module 110. In a preferred embodiment, the extraction module 110 links words in a sentence to other words in the input document 105 through repetitions, morphological relations and lexical relations. An importance score can then be computed for each word in the article 105 based on the number, type and direction (forward, backward) of the lexical links associated with the word. A sentence score can be determined by adding the importance score for each of the words in the sentence and normalizing the sum based on the number of words in the sentence. The sentences can then extracted based on the highest relative sentence scores.
The extraction module 110 provides the extracted sentences 115 to a generation module 120. The generation module 120 also receives the original document 105 as an input. The generation module 120 further includes a sentence reduction module 135 and a sentence combination module 140. The sentence reduction module 135 provides a marked up parse tree as input data for the sentence combination module 140, which generates and outputs the summary sentences. The generation module 120 is operatively coupled to a corpus of human- written summaries 165, a lexical database 170, and a combined reusable lexicon 175. The corpus 165 generally includes a broad collection of human- generated summaries as well as the corresponding original documents. The corpus 165 can also include a sentence reduction subcorpus 165a and a sentence combination subcorpus 165b which can be generated manually or through a decomposition module. The sentence reduction subcorpus 165a includes entries of sentence pairs linking an original sentence to a human reduced sentence. The sentence combination subcorpus 165b includes mappings from human combined sentences to two or more original sentences.
A suitable exemplary corpus 165 was generated using Communications-related Headlines, a free daily online news service provided by the Benton Foundation (http://www.benton.org). The articles from this particular service are communication related, but the topics involved are very broad, including law, company mergers, new technologies, labor issues and so on. Of course, other sources of document summaries can also be used to generate a suitable corpus. To insure that the resulting corpus is somewhat generic, the articles from the selected source should not possess a particular writing style. Thus, preferred sources feature articles from multiple sources or articles from various sections of one or more source. A suitable corpus 165 was generated in four major steps. First, human-written, single document summaries are received from the source. Second, the original documents are retrieved and correlated to the respective summary. The retrieved documents are then "cleaned" by removing irrelevant material such as indexes and advertisements. Finally, the quality of the correspondence between the summary and the original document is verified. The cleaning and verification processes are generally performed manually. The sentence reduction subcorpus 165a and sentence combination subcorpus 165b entries were generated by the decomposition module 185, the operation of which is explained below.
The lexical database 170 can take the form of the WordNet database, which is described in the article "WordNet: A lexical Database for English", by G.A. Miller, Communications of the ACM, Vol. 38, No. 11. pp. 39-41, November 1995. A suitable embodiment of the combined lexicon 175 can be constructed by combining multiple, large-scale resources such as WordNet, the English Verb Classes and Alternations (EVCA) database, the COMLEX syntax dictionary and the Brown Corpus tagged with WordNet senses. The combined lexicon 175 can be formed by encoding the EVCA database with COMLEX compatible syntax and merging the EVCA into the COMLEX database. This results in each verb in the combined lexicon 175 being marked with a list of subcategorizations and alternate syntactic patterns. Preferably, WordNet is added to the EVCA/COMLEX combination to refine the syntactic information and provide additional lexical information to the lexicon 175.
The generation module 120 is also cooperatively coupled to natural language processing (NLP) tools such as a syntactic parser 180 and a co-reference resolving module 190 which can include anaphora resolution. These tools can be software modules which are called by the generation module 120. A suitable syntactic parser 180 is the English Slot Grammar (ESG) parser available from International Business Machines, Inc. A suitable co-reference resolving module 190 is the Deep Read system, available from Mitre, Inc.
Figure 2 is a flow diagram further illustrating the operation of the sentence reduction module 135. The reduction module 135 receives extracted sentences 115 as input (step 205). The reduction module invokes the parser 180 to grammatically parse the extracted sentences 115 and generate a parse tree representation of the sentences (step 210). In step 215 contextual importance is determined by detecting lexical links among words in a local context and then computing an importance score based on the number, type and direction of lexical links detected. The context processing step 215 generates an importance score for each node in the parse tree indicating the relative importance of the nodes to the focus of the input document 105.
The number, type and direction (forward, backward) of lexical links used in the practice of the present invention may vary. An empirical study has demonstrated that the following nine lexical relation types provide a meaningful representation of contextual importance: (1) repetition, (2) inflectional variants, (3) derivational variants, (4) synonyms, (5) hypernyms, (6) antonyms, (7) part-of, (8) entailment (for example: kill → die), and (9) causative (for example: eat → chew). Inflectional variants (2) and derivational variants can be derived from the CELEX database content, available from the Centre for Lexical Information, Max Planck Institute for Psycholinquistics, Nijmegen, which can be in the combined lexicon 175. The other lexical relations can be extracted using the separate lexical database 170, such as WordNet. To frame the local context of a word, a number of sentences before and after the current sentence location are evaluated for the presence of lexical links. The number of sentences selected for this operation involves balancing the level of contextual depth to the amount of processing overhead. Using the five sentences before and the five sentences after the current sentence has been found to provide reasonable local context without incurring excessive processing overhead.
After the lexical links have been identified (step 215 a), an importance score for each word in the extracted sentences can be calculated (step 215 b). Lexical links from the current sentence to subsequent sentences are referred to as forward links and those from the current sentence to preceding sentences are referred to as backward links. The importance score, referred to as the context weight, can be computed as follows:
9
1) ForwarάWeιght(w) = 2. (WιxLι(w)) i = 1
9
2) BackwarάWeιght(w) = 2 (WιxBnumι(w)) i = 1
3) TotalW eight (w) = ForwarάWeιght(w) + BackwardWeιght(w)
max( ForwarάW eight ( w), Backwarά (weight ( w))
4) Ratιo(w) =
Totalweιght(w))
5) ContextWeight - Ratιo(w)xTotalWeιght(w)
where ForwardWeight(w) computes the weight of forward links, BackwardWeight(w) computes the weight of backward links, TotalWeight(w) represents the sum of all links and Ratio(w) computes a weight for the location of the word. To compute the weight of various lexical links, each type of link is assigned a weighted value according to its relative importance. For example, the nine lexical relations set forth above were presented in descending order of importance and accordingly can be assigned linearly decreasing weights such as (1,0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2). The value of Ratiofw) represents the value assigned based on the location of the word in the original document. For example, when a sentence introduces a topic or ends a topic, it is considered more important and the components of those sentences will be assigned a relatively higher location value.
The use of various types of lexical relations improves the relatedness of a word to the main topic. Although simple relations like repetition and synonymy can be used to determine a measure of contextual importance, these surface relations are generally unable to detect more subtle connections between words.
Following context processing (step 215) the reduction module 135 can perform interdependency processing using a probability analysis based on the corpus 165 of human- written reduction based sentences. Such an analysis can indicate the degree of correlation between components in a sentence, such as the relationship between a verb and its subclause.
The probability computation can be performed based on parse trees using probabilities to indicate the degree of correlation between a parent node and its child nodes in the parse tree. Figure 3 illustrates an exemplary fragment of a parse tree used to explain the operation of the probability computation. In Fig. 3, The main verb "give" is the parent nodes 300, and it has four children nodes: subclause conjunct 305, subject 310, indirect object 315 and object 320, respectively. The parse tree can also include further levels below the children nodes, such as nodes ndet 325 and adjp 330 below child node obj 320 and nodes lconj 335 and rconj 340 below node adjp 330, respectively. To measure the interdependency between the verb give and its subclause 305, the probability that the subclause is removed when the verb is give, can be represented by PROB("when_clause is removed'werb = give) . This conditional probability is transformed using Bayes's rule to:
FTOIΞX"v = give']"when_dau-ererfrΛetf')*
FFO Bf when _ dause i s removed" ι"v = give") =
FfOB(verb = give) In a similar fashion, the probabilities that a clause will be reduced or remain unchanged can be calculated in a similar manner.
The probability associated with the other child nodes from the current root node is calculated in a similar manner. After the probabilities for each of the first level child nodes is calculated, each of the child nodes in the current level of the tree is then treated as a parent node and the process is repeated through each descending level of the parse tree until every parent-child node pair has been considered. The probabilities for the corpus 165 can be calculated and stored in a look-up table which is used when a reduction module 135 is run. The context processing of step 215 and probability processing of step
220 provide a relative ranking of sentence components. However, this ranking does not necessarily provide a measure of which components be included to provide a grammatically correct summary sentence. Thus, preferably, after the probability analysis of step 220, reduction processing based on linguistic knowledge is performed (step 225). In this operation, the reduction module 135 works in cooperation with the combined lexicon 175.
The linguistic knowledge processing step 225 operates with the combined lexicon 175 to evaluate the parse tree for each extracted sentence 115 and determine which children nodes are essential to maintain the grammatical correctness of the component represented by the parent node. Linguistic judgments are identified in the parse tree by assigning a binary tag to each node in the parse tree. The value of a tag is either essential or reducible, indicating whether or not a node is indispensable to its parent node. For example, referring to Figure 3, the lexicon 175 will indicate that the verb give needs a subject and two objects. Thus the child nodes subj 310, iobj 315 and obj 320 can be marked as essential. In this case, the child node subclause 305 is then rendered non-essential and will be marked as reducible. The lexicon 175 can also include collocations, such as consist of or replace .... with ...., which prevents removal of indispensable components.
Once the linguistic knowledge processing is applied in step 225, a reduction operation (step 230) can take place. The reduction operation process can be viewed as a series of decision making steps along the edges of a parse tree. Beginning with the root node of the parse tree, the immediate child nodes are evaluated to determine which child nodes can be removed. A child node can be removed if three conditions are satisfied. The first condition is that the component is not a local focus. To determine whether a component is a local focus, the ratio of the context importance score (step 215b) of the child node to that of the root node is calculated. The child node is then considered unimportant if the calculated ratio is smaller than a threshold value. The second condition is that the corpus probability value (step 220) indicating that the special syntactic component of the root is removed is higher than a threshold. The final condition is that the linguistic analysis in step 225 indicates that the child node as reducible.
When the conditions to remove a child node are satisfied, the child node is tagged as "removable" and processing on that branch of the tree terminates. For the child nodes which are retained, the lower levels of the parse tree are evaluated by repeating this process in a similar manner through the tree. The reduction operation step 230 is complete when there are no more nodes to consider. This also concludes processing of the sentence reduction module and results in the parse trees being marked with those components which can be removed or altered by the subsequent paste module 150 operation.
Following processing by the sentence reduction module 135, processing by the sentence combination module 140 is performed. The operation of the sentence combination module 140 is further illustrated in the flow chart of Figure 4.
Using the sentence combination subcorpus 165b, the sentence combination module evaluates the extracted sentence to identify applicable sentence combination operations (step 410). Figure 5 is a table illustrating combination operations such as: add descriptions 510, aggregations 515, substitute incoherent phrases 520, substitute phrases with more general or more specific information 525 and mixed operations 530.
From the sentence combination subcorpus 165b, sentence combination rules are also established to determine whether and how the sentence combination operations of step 410 will take place (step 415). The result is a set of sentence combination rules 420, such as those set forth in Figure 6. The rules illustrated in Figure 6 are exemplary and non-exhaustive. These sentence combination rules 420 were determined empirically by manual inspection of the sentence combination subcorpus 165b. Using the input article 105 and the extracted sentences reduced by the sentence reduction module 135 the sentence combination module 140 in cooperation with the co-reference resolution module 190 applies the sentence combination rules 420 (step 425). The result of step 425 is that the parse trees of the sentences being combined are appropriately tagged to effect the sentence combination. The combination operation is then realized in step 430 using a tree adjoining grammar (TAG) formalism, as described by A. Joshi, "Introduction to Tree- Adjoining
Grammars," in Mathematics of Language, John Benjamins, Amsterdam, 1987. In this way, the sentence combination module 140 performs a paste operation on the marked parse trees and generates a summary sentence.
The document summary is generated by combining the summary sentences. The most straight forward combination is to maintain the order of sentences as they were extracted, however, other sequencing arrangements can also be employed.
As noted above in connection with Figure 1 , the corpus decomposition module 185 operates on the corpus 165 to generate the sentence reduction subcorpus 165a and the sentence combination subcorpus 165b. The decomposition module 185 generally operates to evaluate the human written summaries in the corpus 165, compare the summary sentences to the original document, determine if a summary sentence was generated by a cut and past operation and identify where the components of the summary sentences were taken from in the original documents. The operation of the decomposition module 185 is illustrated in the flow diagram of Figure 7.
Referring to Figure 7, the decomposition module 185 uses the human- generated summary and original document as inputs to an indexing operation (step 705). During indexing, each word in the original document is indexed according to its positions in the original document. A convenient way of referencing these occurrences is by sentence number and word number in the original document. To evaluate the index of words, a set of heuristic rules is developed by manual inspection of the corpus 165. Such inspection reveals that human-generated summaries often include one or more of six operations: sentence reduction, sentence combination, syntactic transformation, lexical paraphrasing, generalization/specification, and content reordering. The heuristic rules can be represented using a bigram probability PROB (W2 = (S2, W2) \ W, = (S„ W,)) (abbreviated as PROB(W2\ Wj) in the following discussion). The probability values can be assigned in the following manner:
•IF ((Sj = S2)and(W, = W2 - 1)) (i.e., the words are in two adjacent positions in the document), THEN PROB(W2\ W is assigned the maximal value, PI. (Rule: Two adjacent words in the summary are most likely to come from two adjacent words in the document.)
•IF ((S, = S2)and(W, < W2 - 1)), THEN PROB(W2\ W,) is assigned the second highest value, P2. (Rule: Adjacent words in the summary are highly likely to come from the same sentence in the document, retaining their relative precedent relation, as in sentence reduction. This rule can be further refined by adding restrictions on distance between words.)
•IF ((_?, = S2)and(W, > W2)), THEN PROB(W2\ W,) is assigned the third highest value, P3. (Rule: Adjacent words in the summary are likely to come from the same sentence in the document but reverse their relative orders, such as in the case of sentence reduction with syntactic transformations.)
•IF (S2 - CONST < S < S2), THEN PROB(W2 \ W,) is assigned the fourth highest value, P4. (Rule: Adjacent words in the summary can come from nearby sentences in the document and retain their relative order, such as in sentence combination. CONST is a small constant such as 3 or 5.)
•IF (S2 < S, < S2 + CONST), THEN PROB(W2\ W,) is assigned the fifth highest value, P5. (Rule: Adjacent words in the summary can come from nearby sentences in the document but reverse their relative orders.)
•IF (\S2 - Sj \ > CONST) THEN PROB(W2\ W,) is assigned a small value, P6. (Rule: Adjacent words in the summary are not very likely to come from sentences far apart.) Based on the above heuristic principles, a Hidden Markov Model can be generated, such as is illustrated in Figure 8 (step 710). The nodes in the Hidden Markov Model represent possible positions in the document, and the edges output the probability of going from one node to another. This Hidden Markov Model is used in finding the most likely position sequence in a subsequent processing operation.
Assigning values to P1-P6 is performed empirically. For example, the maximal value can be assigned 1 and others are assigned evenly decreasing values 0.9, 0.8 and so on. The order of the above rules is based on the empirical observations on a particular set of summaries. These values, however, can be adjusted or even trained for different corpora.
A Viterbi algorithm can be used to evaluate the Hidden Markov Model and find the most likely sequence of words incrementally (step 715). The Viterbi algorithm first finds the most likely sequence for (Word,Word2), for each possible position of Word2. This information is then used to compute the most likely sequence for (Word, Word2Word3), for each possible position of Word3. The process repeats until all the words in the sequence have been considered.
After evaluation by the Viterbi algorithm, post-editing operations can be used to cancel mismatches that occur in the corpus analysis. The result is that summary sentences are matched to the corresponding phrases in the document. Once the summary sentences are so matched, it is a simple endeavor to sort the various matchings to one of the sentence reduction subcorpus 165a and sentence combination subcorpus 165b. In addition, the decomposition module 185 can be used as a stand alone tool, apart from the rest of the present summary generation system, to perform various summary analysis operations. Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions and alterations can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

1. A system for generating a summary of an input document comprising: an extraction module, the extraction module receiving the input document and extracting at least one sentence related to a focus of the document; a summary sentence generation module operatively coupled to the extraction module; a grammatical parser operatively coupled to the generation module for parsing the extracted sentences into components in a grammatical representation; a combined lexicon operatively coupled to the generation module; and a corpus of human generated summaries operatively coupled to the generation module.
2. The system for generating a summary of an input document of claim 1 , wherein the generation module further comprises a sentence reduction module.
3. The system for generating a summary of an input document of claim 2, wherein the sentence reduction module is cooperatively engaged with the corpus and performs probabilistic importance processing on the components of the grammatical representation in accordance with the corpus.
4. The system for generating a summary of an input document of claim 3 , wherein the sentence reduction module is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation.
5. The system for generating a summary of an input document of claim 4, wherein the context importance processing includes establishing a plurality of lexical links of a least one type for the components and generating a context importance score based on the type and number of links associated with the components.
6. The system for generating a summary of an input document of claim 5, wherein the sentence reduction module further computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon.
7. The system for generating a summary of an input document of claim 1 , wherein the generation module further comprises a sentence combination module.
8. The system for generating a summary of claim 7, wherein the sentence combination module is operatively coupled to the corpus and wherein the sentence combination module: identifies at least one sentence combination operation; establishes at least one rule for applying the sentence combination operation; and applies the at least one rule to combine at least two extracted sentences.
9. The system for generating a summary of claim 8, wherein the at least one sentence combination operation is selected from the group consisting of add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations.
10. The system for generating a summary of claim 9, wherein the at least one rule to combine extracted sentences includes replacing a partial name phrase with a full name phrase.
11. The method of generating a summary of claim 10, wherein the at least one rule to combine extracted sentences includes determining if two sentences having a common subject are proximate and whether at least one sentence is marked for reduction then removing the subject of the second sentence and combining with the first sentence using the connective "and."
12. The system for generating a summary of an input document of claim 1 , wherein the generation module further comprises a sentence reduction module and a sentence combination module.
13. The system for generating a summary of an input document of claim 12, wherein the sentence reduction module is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation.
14. The system for generating a summary of an input document of claim 13, wherein the context importance processing includes establishing a plurality of lexical links of a least one type for the components and generating a context importance score based on the type and number of links associated with the components.
15. The system for generating a summary of an input document of claim 14, wherein the sentence reduction module further computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon.
16. The system for generating a summary of an input document of claim 15, wherein the sentence reduction module is cooperatively engaged with the corpus and performs probabilistic importance processing on the components of the grammatical representation in accordance with the corpus.
17. The system for generating a summary of an input document of claim 12, wherein the sentence combination module is operatively coupled to the corpus and wherein the sentence combination module: identifies at least one sentence combination operation; establishes at least one rule for applying the sentence combination operation; and applies the at least one rule to combine at least two extracted sentences.
18. The system for generating a summary of claim 17, wherein the at least one sentence combination operation is selected from the group consisting of add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations.
19. The system for generating a summary of claim 18, wherein the at least one rule to combine extracted sentences includes replacing a partial name phrase with a full name phrase.
20. The method of generating a summary of claim 19, wherein the at least one rule to combine extracted sentences includes determining if two sentences having a common subject are proximate and whether at least one sentence is marked for reduction then removing the subject of the second sentence and combining with the first sentence using the connective "and."
21. The system for generating a summary of an input document of claim 1 , further comprising a decomposition module operatively coupled to the corpus, the decomposition module analyzing the corpus and generating a sentence reduction subcorpus and a sentence combination subcorpus.
22. A method of generating a summary of an input document comprising: extracting at least one sentence from the document; parsing the at least one sentence into components; performing a sentence reduction operation to mark components which can be removed from the sentence; performing a sentence combination operation to mark components of at least two sentences which can be merged; and operating on the marked components to effect the indicated removal and combination of sentence components.
23. The method of generating a summary of claim 22, wherein the sentence reduction operation comprises: measuring the contextual importance of the components; measuring the probabilistic importance of the components based on a given corpus; measuring the importance of the components based on linguistic knowledge; synthesizing the contextual, probabilistic and knowledge based importance measures into a relative importance score for each component; and marking those components having an importance score below a threshold value for removal.
24. The method of generating a summary of claim 23, wherein the contextual importance is measured by: identifying a plurality of lexical links of at least one type among the components in a local context in the document; and computing a content importance score according to the type and number of lexical links associated with each component.
25. The method of generating a summary of claim 24, wherein the at least one type of lexical links are selected from the group consisting of repetition, inflectional variants, derivational variants, synonyms, hypernyms, antonyms, part-of, entailment, and causative links.
26. The method of generating a summary of claim 23, wherein the probabilistic importance score is determined based on a corpus of human- written summaries.
27. The method of generating a summary of claim 23, wherein the linguistic knowledge operation includes the use of a combined lexicon.
28. The method of generating a summary of claim 22, wherein the sentence combination operation further comprises: identifying at least one sentence combination operation; establishing at least one rule for applying the sentence combination operation; and applying the at least one rule to combine at least two extracted sentences.
29. The method of generating a summary of claim 28, wherein the at least one sentence combination operation is selected from the group consisting of add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations.
30. The method of generating a summary of claim 28, wherein the at least one rule to combine extracted sentences includes replacing a partial name phrase with a full name phrase.
31. The method of generating a summary of claim 28, wherein the at least one rule to combine extracted sentences includes determining if two sentences having a common subject are proximate and whether at least one sentence is marked for reduction then removing the subject of the second sentence and combining with the first sentence using the connective "and."
32. A method of identifying correspondence between phrases in a sentence in a summary and phrases in the original document corresponding to the summary comprising: establishing a plurality of heuristic rules for identifying a cut and paste summarization operation; building a probability model based on the heuristic rules; and calculating the best solution of the probability model to map a correspondence between the summary phrases and the original phrases.
33. The method of claim 32, wherein the probability model is a Hidden Markov Model.
34. The method of claim 33, wherein a Viterbi algorithm is employed to calculate the best solution.
35. A corpus for a summarization system comprising: a plurality of documents; a plurality of human generated summaries associated with the plurality of documents; a sentence combination subcorpus; and a sentence reduction subcorpus.
36. The corpus of claim 35, wherein the sentence combination subcorpus includes at least one mapping between a summary sentence and at least two original sentences containing phrases in the summary sentence.
37. The corpus of claim 35, wherein the sentence reduction subcorpus includes at least one sentence pair, each sentence pair having a summary sentence and a corresponding original sentence.
PCT/US2000/004505 1999-02-19 2000-02-22 Cut and paste document summarization system and method WO2001006408A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP00915831A EP1208455A4 (en) 1999-02-19 2000-02-22 Cut and paste document summarization system and method
AU37038/00A AU778394B2 (en) 1999-02-19 2000-02-22 Cut and paste document summarization system and method
CA002363834A CA2363834A1 (en) 1999-02-19 2000-02-22 Cut and paste document summarization system and method
IL14495000A IL144950A0 (en) 1999-02-19 2000-02-22 Cut and paste document summarization system and method
HK02108117.9A HK1046570A1 (en) 1999-02-19 2002-11-08 Cut and paste document summarization system and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US12065799P 1999-02-19 1999-02-19
US60/120,657 1999-02-19

Publications (1)

Publication Number Publication Date
WO2001006408A1 true WO2001006408A1 (en) 2001-01-25

Family

ID=22391719

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/004505 WO2001006408A1 (en) 1999-02-19 2000-02-22 Cut and paste document summarization system and method

Country Status (6)

Country Link
EP (1) EP1208455A4 (en)
AU (1) AU778394B2 (en)
CA (1) CA2363834A1 (en)
HK (1) HK1046570A1 (en)
IL (1) IL144950A0 (en)
WO (1) WO2001006408A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020289A (en) * 2012-12-25 2013-04-03 浙江鸿程计算机系统有限公司 Method for providing individual needs of search engine user based on log mining
EP3062238A1 (en) * 2015-02-27 2016-08-31 Samsung Electronics Co., Ltd. Summarization by sentence extraction and translation of summaries containing named entities
CN117591666A (en) * 2024-01-18 2024-02-23 交通运输部公路科学研究所 Abstract extraction method for bridge management and maintenance document

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460150A (en) * 2018-03-23 2018-08-28 北京奇虎科技有限公司 The processing method and processing device of headline
CN116501862B (en) * 2023-06-25 2023-09-12 桂林电子科技大学 Automatic text extraction system based on dynamic distributed collection

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5077668A (en) * 1988-09-30 1991-12-31 Kabushiki Kaisha Toshiba Method and apparatus for producing an abstract of a document
US5778397A (en) * 1995-06-28 1998-07-07 Xerox Corporation Automatic method of generating feature probabilities for automatic extracting summarization
US5838323A (en) * 1995-09-29 1998-11-17 Apple Computer, Inc. Document summary computer system user interface
US5918240A (en) * 1995-06-28 1999-06-29 Xerox Corporation Automatic method of extracting summarization using feature probabilities
US5924108A (en) * 1996-03-29 1999-07-13 Microsoft Corporation Document summarizer for word processors

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5077668A (en) * 1988-09-30 1991-12-31 Kabushiki Kaisha Toshiba Method and apparatus for producing an abstract of a document
US5778397A (en) * 1995-06-28 1998-07-07 Xerox Corporation Automatic method of generating feature probabilities for automatic extracting summarization
US5918240A (en) * 1995-06-28 1999-06-29 Xerox Corporation Automatic method of extracting summarization using feature probabilities
US5838323A (en) * 1995-09-29 1998-11-17 Apple Computer, Inc. Document summary computer system user interface
US5924108A (en) * 1996-03-29 1999-07-13 Microsoft Corporation Document summarizer for word processors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1208455A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020289A (en) * 2012-12-25 2013-04-03 浙江鸿程计算机系统有限公司 Method for providing individual needs of search engine user based on log mining
EP3062238A1 (en) * 2015-02-27 2016-08-31 Samsung Electronics Co., Ltd. Summarization by sentence extraction and translation of summaries containing named entities
CN117591666A (en) * 2024-01-18 2024-02-23 交通运输部公路科学研究所 Abstract extraction method for bridge management and maintenance document
CN117591666B (en) * 2024-01-18 2024-05-10 交通运输部公路科学研究所 Abstract extraction method for bridge management and maintenance document

Also Published As

Publication number Publication date
EP1208455A1 (en) 2002-05-29
IL144950A0 (en) 2002-06-30
AU778394B2 (en) 2004-12-02
EP1208455A4 (en) 2006-08-09
AU3703800A (en) 2001-02-05
CA2363834A1 (en) 2001-01-25
HK1046570A1 (en) 2003-01-17

Similar Documents

Publication Publication Date Title
US9336192B1 (en) Methods for analyzing text
CN103399901B (en) A kind of keyword abstraction method
Farouk Measuring text similarity based on structure and word embedding
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
US20070136246A1 (en) Answer determination for natural language questioning
US20030195872A1 (en) Web-based information content analyzer and information dimension dictionary
CN103646112B (en) Dependency parsing field self-adaption method based on web search
Emami et al. A knowledge hunting framework for common sense reasoning
Trabelsi et al. Bridging folksonomies and domain ontologies: Getting out non-taxonomic relations
Ferreira et al. Improving NLTK for processing Portuguese
Šnajder et al. Building and evaluating a distributional memory for Croatian
Sheridan et al. The application of morpho-syntactic language processing to effective phrase matching
JP6108212B2 (en) Synonym extraction system, method and program
Hosseinikhah et al. A new Persian text summarization approach based on natural language processing and graph similarity
WO2001001289A1 (en) Semantic processor and method with knowledge analysis of and extraction from natural language documents
AU778394B2 (en) Cut and paste document summarization system and method
Walas et al. Named entity recognition in a Polish question answering system
Bopche et al. Grammar checking system using rule based morphological process for an Indian language
Kishore et al. Document Summarization in Malayalam with sentence framing
Jebbor et al. Overview of knowledge extraction techniques in five question-answering systems
Massagram et al. A novel technique for Thai document plagiarism detection using syntactic parse trees.
Vayadande et al. Conversion of Ambiguous Grammar to Unambiguous Grammar using Parse Tree
Biswas et al. Development of a Bangla sense annotated corpus for word sense disambiguation
Pajić et al. Information extraction from semi-structured resources: a two-phase finite state transducers approach
Barrett et al. Automated clinical coding using semantic atoms and topology

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 144950

Country of ref document: IL

ENP Entry into the national phase

Ref document number: 2363834

Country of ref document: CA

Ref document number: 2363834

Country of ref document: CA

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: IN/PCT/2001/00736/DE

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 37038/00

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2000915831

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWE Wipo information: entry into national phase

Ref document number: 09913746

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 2000915831

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 37038/00

Country of ref document: AU

WWW Wipo information: withdrawn in national office

Ref document number: 2000915831

Country of ref document: EP