EP0968478A1 - Method for automatically generating a summarized text by a computer - Google Patents

Method for automatically generating a summarized text by a computer

Info

Publication number
EP0968478A1
EP0968478A1 EP19980914784 EP98914784A EP0968478A1 EP 0968478 A1 EP0968478 A1 EP 0968478A1 EP 19980914784 EP19980914784 EP 19980914784 EP 98914784 A EP98914784 A EP 98914784A EP 0968478 A1 EP0968478 A1 EP 0968478A1
Authority
EP
Grant status
Application
Patent type
Prior art keywords
text
summary
sentence
word
ƒ
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP19980914784
Other languages
German (de)
French (fr)
Inventor
Thomas BRÜCKNER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2785Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2705Parsing
    • G06F17/2715Statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2765Recognition
    • G06F17/277Lexical analysis, e.g. tokenisation, collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2785Semantic analysis
    • G06F17/279Discourse representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
    • G06F17/30716Browsing or visualization
    • G06F17/30719Summarization for human users
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access

Abstract

The inventive method enables sentence-based automatic summary of a text on a computer. Subject-related lexica are used which provide a measure of relevance for every word contained therein. Each sentence of the text to be summarized is processed word by word and the frequency of each individual word is computed and weighted with the measure of relevance. In order to carry out the summary, sentences (n) with the highest probability of being included in said summary are compiled, wherein (n) is a predefinable reduction variable.

Description

description

A method for automatically generating a summary of a text by a computer

The invention relates to a method for automatically generating a summary of a text by a computer.

[2] a process for the automatic summarization of a text is known. In doing

Merkmalswahrschemlichkeiten determined that allow automatic summarization.

Nowadays it is difficult and sometimes tedious, from a flood of information that select important information to be predefined personal criteria. But even after the selection are often almost inexhaustible masses of data, for example m the form of articles, at your disposal. Since it is a simple matter using computers to capture large amounts of data to manage, it is natural to use the computer also for the treatment or for the selection of information. Such an automatic reduction of information is to enable a user to have to read a significantly smaller amount of data to arrive at the relevant to him information.

A special kind of information reduction is m the summary of texts.

[1] a method for combining text is known, which uses heuristic features with a discrete range of values. The probability that a set of the text belongs to the summary with the proviso that a heuristic feature has a particular value is estimated from a training set of summaries. The object of the invention is to automatically generate a summary of a given text, these Summary m shorthand to reflect the substantial contents of the text.

This object is achieved according to the features of claim 1.

The erfmdungsgemaße method allows a summary of a text in that a probability is determined for it, for each set of this text, that the sentence belongs to the summary. In this case, it is determined the relevance measure for each word m the set of one lexicon that contains all of the relevant words with a predetermined relevance measure for each of these words. The accumulation of all relevance measures gives the probability for membership of the sentence to the summary. Then all propositions are sorted according to their probability. A predeterminable amount of reduction that indicates what percentage of the original text represented in the abstract, is used for the selection of the given amount of reduction by this number of sets from the sorted representation. Are the main x-percentages selected, they will be displayed as a summary of the text of its original m, given by this text, order.

An advantageous development of the inventive method is additionally introduce a Emzelworthaufigkeit to the relevancy. This Emzelworthauflgkeit indicates how many times the word appears in each case considered throughout the summarized text. Taking into account the relevancy measure and this newly introduced

Emzelworthaufigkeit, the probability that the particular set is m the Summary contain, are specified by the following rule:

wherein (set) dιe probability of membership of the set to the summary,

N is the total number of these words occurring in

Set, l is a number variable (ι = l, 2, ..., N) for all the words in the sentence, tf is the frequency of occurrence of each word under consideration throughout summarized text (Emzelworthaufigkeit) and rlv the relevancy measure for each word in the sentence , describe.

Here it should be noted that occurring in the lexicon words rlv with their known from the lexicon of relevance are crucial. Is a word that does not exist the lexicon, n times before, this word increases the probability that the sentence belongs to the summary, do not.

A development of the erf dungsgemaßen method is to use an application-specific lexicon. This causes the summary is performed with a predetermined proper specific filter. So a specified on sports lexicon contributions, for example, reviewing a summarized text sport-related words with higher relevance, as an encyclopedia, which specializes in abstracts economic contributions. So it can advantageously be provided appropriate lexicons specific knowledge of definable categories with each category.

Further, it is advantageous to assign a text of one or more categories. This can be performed automatically by predetermined special words in the topic-related lexica be used as a selection criterion for an allocation to the respective fields. If several categories (topics), so different perspectives or filter, possible for the summary of a text, it can be created an automatically different summaries for each category.

The invention is further illustrated with reference to an exemplary embodiment illustrated in FIGS.

Show it

Fig. 1 is a diagram illustrating a system for automatically generating a summary

Fig. 2 is a block diagram illustrating the steps of the inventive method.

In Fig. 1, a system is shown, with the automatic generation of a summary of text by a

Computer is performed. A zusammenzufassender text, as a result of a database query, are present either in written form TXT, for example on paper, or m digital form DIGTXT, for example.

To edit the text on paper TXT erfmdungsgemaß, it is necessary to make this accessible to the computer. With the text TXT is read by the scanner SC and stored as an image file BD. A text recognition software OCR converts the present as an image file BD Text TXT m a machine-readable format, such as ASCII to. The digital form text DIGTXT exists already in machine-readable format.

Furthermore, a predetermined number are thematic

Encyclopedias to each topic a lexicon stock. In Fig. 1, the subject-related lexica are indicated as blocks LEX1, LEX2 and LEX3. There are many ways conceivable how the contents of the thematic lexicons are built. One possibility is to analyze texts categorized automatically by Worthauflgkeiten be selected as a significant criterion for the respective category.

Based on the encyclopedias, it is possible to categorize the summarized text automatically (in block KatSel) by predetermined words in the thematic encyclopedias, if they are present the summarized text, tip the scales for a summary with respect to the respective affected thematic lexicon. In such a case too αiesem lexicon matching theme-related summary is created.

Here, it is noted that advantageously, the words m summarized the text are returned to their respective basic shape (this is done m the block LEM), and each word a reference to its part of speech obtained (block TAG).

For each category (theme) the summary is created according to the invention (in block KatSel) by means of the corresponding lexicon. This results in subject-specific summaries ZFS1 and ZFS2.

The steps which lead to the summary of the text, are shown in detail in FIG. 2. For clarity, the m 2 Abbreviations used in the following summarized.:

SZ sentence

WK (SZ) SZ probability of sentence,

W word, tf (W) Emzelworthaufigkeit of the word W (set in SZ) and rev (W) of relevance of the word W (in the set SZ). In step 2a is selected at the beginning of the method, the first set and the probability that this set belongs to the Abstract, is set equal to the 0th In step 2b, the first word of this sentence is selected. Since the probability that this sentence belongs to the summary, from the

Probabilities of the individual words is composed, for each word in the sentence in the loop from step 2c to step the respective probability to Gesamtwahrschemlichkeit for the whole sentence 2e cumulated. Are all words processed in the set, the probability of each set is normalized by de number of words. The steps described are for all sentences in the text is performed (step 2g, 2h, 2ι). If the last block has been processed in the text, the sentences are ordered by their

Probability sorted (step 2j). According to a predetermined amount of reduction are selected corresponding to the amount of reduction n best sentence in step 2k, and then displayed 2m m their original order in the step.

Bibliography :

[1] J.Kupiec, J.Pedersen and F.Chen, "A Trainable Document Summarizer" Xerox Palo Alto Research Center., 1995

[2] EP 0751470 AI

Claims

Patentanspr├╝che
1. A method for the automatic generation of a
Summary of a text by a computer, a) in which für each set, a probability is dafür determined, the rate to the summary daß gehört by für each word m to the set contains a lexicon, the application-specific words with a predetermined Relevanzmaß to each of these words, the Relevanzmaß is determined and all
Relevanzmaße cumulative probability für the Zugehörigkeit of the sentence to the summary result, b) in which all the sentences of the text are sorted according to the probabilities, c) wherein in accordance with a predeterminable Reduktionsmaß for combining the best are displayed in a given sentence by the text order.
2. The method of claim 1, wherein the zusätzlich to the Relevanzmaß a Emzelworthaufigkeit für each word is determined and the likelihood dafür daß, the respective set is m the Summary contain, by the following procedure is determined:
wherein wκ (set) dle probability für a Zugehörigkeit of the sentence to the summary, N the total number of these words occurring in a sentence, I is a number variable (ι = l, 2, ..., N ) für all words in the sentence, tf the Häufigkeit of occurrence of the respectively considered in the entire word summarized text
(Einzelworthäufigkeit) and rlv the Relevanzmaß für each said word in
Set, indicate.
3. The method of claim 1 or 2, wherein the one or more categories text f├╝r that is used in each case an application-specific lexicon is allocated.
4. The method according to one of Anspr├╝che 1 to 3, wherein f├╝r each allocation of the text to a category to an application specific summary is created.
EP19980914784 1997-03-18 1998-02-18 Method for automatically generating a summarized text by a computer Withdrawn EP0968478A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
DE19711284 1997-03-18
DE19711284 1997-03-18
PCT/DE1998/000485 WO1998041930A1 (en) 1997-03-18 1998-02-18 Method for automatically generating a summarized text by a computer

Publications (1)

Publication Number Publication Date
EP0968478A1 true true EP0968478A1 (en) 2000-01-05

Family

ID=7823794

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19980914784 Withdrawn EP0968478A1 (en) 1997-03-18 1998-02-18 Method for automatically generating a summarized text by a computer

Country Status (4)

Country Link
US (1) US6401086B1 (en)
EP (1) EP0968478A1 (en)
JP (1) JP2001515623A (en)
WO (1) WO1998041930A1 (en)

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6789230B2 (en) * 1998-10-09 2004-09-07 Microsoft Corporation Creating a summary having sentences with the highest weight, and lowest length
US7475334B1 (en) * 2000-01-19 2009-01-06 Alcatel-Lucent Usa Inc. Method and system for abstracting electronic documents
EP1380170A2 (en) * 2000-11-14 2004-01-14 Philips Electronics N.V. Summarization and/or indexing of programs
WO2003012661A1 (en) * 2001-07-31 2003-02-13 Invention Machine Corporation Computer based summarization of natural language documents
US8799776B2 (en) * 2001-07-31 2014-08-05 Invention Machine Corporation Semantic processor for recognition of whole-part relations in natural language documents
US9009590B2 (en) * 2001-07-31 2015-04-14 Invention Machines Corporation Semantic processor for recognition of cause-effect relations in natural language documents
US6904564B1 (en) * 2002-01-14 2005-06-07 The United States Of America As Represented By The National Security Agency Method of summarizing text using just the text
US7650562B2 (en) * 2002-02-21 2010-01-19 Xerox Corporation Methods and systems for incrementally changing text representation
US7549114B2 (en) * 2002-02-21 2009-06-16 Xerox Corporation Methods and systems for incrementally changing text representation
US20040199408A1 (en) * 2003-04-01 2004-10-07 Johnson Tolbert R. Medical information card
US9275052B2 (en) 2005-01-19 2016-03-01 Amazon Technologies, Inc. Providing annotations of a digital work
US8131647B2 (en) 2005-01-19 2012-03-06 Amazon Technologies, Inc. Method and system for providing annotations of a digital work
US8234279B2 (en) * 2005-10-11 2012-07-31 The Boeing Company Streaming text data mining method and apparatus using multidimensional subspaces
US7752204B2 (en) * 2005-11-18 2010-07-06 The Boeing Company Query-based text summarization
US7831597B2 (en) * 2005-11-18 2010-11-09 The Boeing Company Text summarization method and apparatus using a multidimensional subspace
US8352449B1 (en) 2006-03-29 2013-01-08 Amazon Technologies, Inc. Reader device content indexing
US20080005284A1 (en) * 2006-06-29 2008-01-03 The Trustees Of The University Of Pennsylvania Method and Apparatus For Publishing Textual Information To A Web Page
US8725565B1 (en) 2006-09-29 2014-05-13 Amazon Technologies, Inc. Expedited acquisition of a digital item following a sample presentation of the item
US9672533B1 (en) 2006-09-29 2017-06-06 Amazon Technologies, Inc. Acquisition of an item based on a catalog presentation of items
US7865817B2 (en) 2006-12-29 2011-01-04 Amazon Technologies, Inc. Invariant referencing in digital works
US7751807B2 (en) 2007-02-12 2010-07-06 Oomble, Inc. Method and system for a hosted mobile management service architecture
US9031947B2 (en) * 2007-03-27 2015-05-12 Invention Machine Corporation System and method for model element identification
US9665529B1 (en) 2007-03-29 2017-05-30 Amazon Technologies, Inc. Relative progress and event indicators
US7716224B2 (en) 2007-03-29 2010-05-11 Amazon Technologies, Inc. Search and indexing on a user device
US20080288488A1 (en) * 2007-05-15 2008-11-20 Iprm Intellectual Property Rights Management Ag C/O Dr. Hans Durrer Method and system for determining trend potentials
US8234282B2 (en) 2007-05-21 2012-07-31 Amazon Technologies, Inc. Managing status of search index generation
US20080301579A1 (en) * 2007-06-04 2008-12-04 Yahoo! Inc. Interactive interface for navigating, previewing, and accessing multimedia content
US8024400B2 (en) 2007-09-26 2011-09-20 Oomble, Inc. Method and system for transferring content from the web to mobile devices
US9087032B1 (en) * 2009-01-26 2015-07-21 Amazon Technologies, Inc. Aggregation of highlights
US8378979B2 (en) 2009-01-27 2013-02-19 Amazon Technologies, Inc. Electronic device with haptic feedback
CN102439590A (en) * 2009-03-13 2012-05-02 发明机器公司 System and method for automatic semantic labeling of natural language texts
EP2406739A2 (en) * 2009-03-13 2012-01-18 Invention Machine Corporation System and method for knowledge research
US8832584B1 (en) 2009-03-31 2014-09-09 Amazon Technologies, Inc. Questions on highlighted passages
US8692763B1 (en) 2009-09-28 2014-04-08 John T. Kim Last screen rendering for electronic book reader
US9495322B1 (en) 2010-09-21 2016-11-15 Amazon Technologies, Inc. Cover display
US9842168B2 (en) 2011-03-31 2017-12-12 Microsoft Technology Licensing, Llc Task driven user intents
US9858343B2 (en) 2011-03-31 2018-01-02 Microsoft Technology Licensing Llc Personalization of queries, conversations, and searches
US9760566B2 (en) 2011-03-31 2017-09-12 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US9298287B2 (en) 2011-03-31 2016-03-29 Microsoft Technology Licensing, Llc Combined activation for natural user interface systems
US9244984B2 (en) 2011-03-31 2016-01-26 Microsoft Technology Licensing, Llc Location based conversational understanding
US9454962B2 (en) * 2011-05-12 2016-09-27 Microsoft Technology Licensing, Llc Sentence simplification for spoken language understanding
US9158741B1 (en) 2011-10-28 2015-10-13 Amazon Technologies, Inc. Indicators for navigating digital works
US9064006B2 (en) 2012-08-23 2015-06-23 Microsoft Technology Licensing, Llc Translating natural language utterances to keyword search queries

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4965763A (en) * 1987-03-03 1990-10-23 International Business Machines Corporation Computer method for automatic extraction of commonly specified information from business correspondence
US4930077A (en) * 1987-04-06 1990-05-29 Fan David P Information processing expert system for text analysis and predicting public opinion based information available to the public
JP2783558B2 (en) * 1988-09-30 1998-08-06 株式会社東芝 Summarization method and summary generator
JP2790466B2 (en) * 1988-10-18 1998-08-27 株式会社日立製作所 String search method and apparatus
JPH03278270A (en) * 1990-03-28 1991-12-09 Ricoh Co Ltd Abstract document forming device
US5317507A (en) * 1990-11-07 1994-05-31 Gallant Stephen I Method for document retrieval and for word sense disambiguation using neural networks
US5325298A (en) * 1990-11-07 1994-06-28 Hnc, Inc. Methods for generating or revising context vectors for a plurality of word stems
US5799299A (en) * 1994-09-14 1998-08-25 Kabushiki Kaisha Toshiba Data processing system, data retrieval system, data processing method and data retrieval method
JPH08305695A (en) * 1995-04-28 1996-11-22 Fujitsu Ltd Document processor
US5887120A (en) * 1995-05-31 1999-03-23 Oracle Corporation Method and apparatus for determining theme for discourse
US5778397A (en) 1995-06-28 1998-07-07 Xerox Corporation Automatic method of generating feature probabilities for automatic extracting summarization
EP0856175A4 (en) * 1995-08-16 2000-05-24 Univ Syracuse Multilingual document retrieval system and method using semantic vector matching
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US5963893A (en) * 1996-06-28 1999-10-05 Microsoft Corporation Identification of words in Japanese text by a computer system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9841930A1 *

Also Published As

Publication number Publication date Type
WO1998041930A1 (en) 1998-09-24 application
JP2001515623A (en) 2001-09-18 application
US6401086B1 (en) 2002-06-04 grant

Similar Documents

Publication Publication Date Title
Edmundson Problems in automatic abstracting
Coltheart The MRC psycholinguistic database
Van Ham et al. Mapping text with phrase nets
Kupiec et al. A trainable document summarizer
US6263121B1 (en) Archival and retrieval of similar documents
US6778979B2 (en) System for automatically generating queries
US8239413B2 (en) System with user directed enrichment
US6456738B1 (en) Method of and system for extracting predetermined elements from input document based upon model which is adaptively modified according to variable amount in the input document
Lewis Feature selection and feature extraction for text categorization
Hirst et al. Lexical chains as representations of context for the detection and correction of malapropisms
US6847972B1 (en) Apparatus for classifying or disambiguating data
US6353840B2 (en) User-defined search template for extracting information from documents
Bekkerman et al. On feature distributional clustering for text categorization
US6928425B2 (en) System for propagating enrichment between documents
US6553373B2 (en) Method for dynamically delivering contents encapsulated with capsule overviews corresonding to the plurality of documents, resolving co-referentiality related to frequency within document, determining topic stamps for each document segments
Park et al. Hybrid text mining for finding abbreviations and their definitions
US5634084A (en) Abbreviation and acronym/initialism expansion procedures for a text to speech reader
US6389435B1 (en) Method and system for copying a freeform digital ink mark on an object to a related object
US6584470B2 (en) Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction
US5168565A (en) Document retrieval system
US20040181427A1 (en) Computer-implemented patent portfolio analysis method and apparatus
US6820075B2 (en) Document-centric system with auto-completion
US20100250547A1 (en) System for Automatically Generating Queries
Mani Advances in automatic text summarization
US4674066A (en) Textual database system using skeletonization and phonetic replacement to retrieve words matching or similar to query words

Legal Events

Date Code Title Description
17P Request for examination filed

Effective date: 19990903

AK Designated contracting states:

Kind code of ref document: A1

Designated state(s): DE FR GB

18D Deemed to be withdrawn

Effective date: 20030902