EP0968478A1

EP0968478A1 - Method for automatically generating a summarized text by a computer

Info

Publication number: EP0968478A1
Application number: EP98914784A
Authority: EP
Inventors: Thomas BRÜCKNER
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1997-03-18
Filing date: 1998-02-18
Publication date: 2000-01-05
Also published as: US6401086B1; JP2001515623A; WO1998041930A1

Abstract

The inventive method enables sentence-based automatic summary of a text on a computer. Subject-related lexica are used which provide a measure of relevance for every word contained therein. Each sentence of the text to be summarized is processed word by word and the frequency of each individual word is computed and weighted with the measure of relevance. In order to carry out the summary, sentences (n) with the highest probability of being included in said summary are compiled, wherein (n) is a predefinable reduction variable.

Description

description

Method for the automatic generation of a summary of a text by a computer

The invention relates to a method for the automatic generation of a summary of a text by a computer.

A method for automatically summarizing a text is known from [2]. In doing so

Characteristic probabilities determined that allow an automatic summary.

Nowadays it is difficult and sometimes tedious to select the information that is important according to personal criteria that can be specified from a flood of information. But even after the selection, there are often almost inexhaustible masses of data, e.g. m form of articles, available. Since it is easy to use computers to record and manage large amounts of data, it makes sense to use the computer to process or select information. Such an automatic information reduction is intended to enable a user to have to read a significantly smaller amount of data in order to arrive at the information relevant to him.

A special type of information reduction consists in the merging of texts.

A method for summarizing texts is known from [1] which uses heuristic features with a discrete range of values. The probability that a sentence from the text belongs to the summary on the condition that a heuristic feature has a certain value is estimated from a training set of summaries. The object of the invention is to automatically generate a summary from a given text, which summary is intended to represent the essential contents of the text in short form.

This object is achieved according to the features of claim 1.

The method according to the invention enables a text to be summarized by determining for each sentence of this text a probability that the sentence belongs to the summary. The relevance measure is determined for each word m in the sentence from a lexicon which contains all relevant words with a predefined relevance measure for each of these words. The accumulation of all relevance measures gives the probability of the sentence belonging to the summary. All records are then sorted according to their probability. A predeterminable reduction measure, which indicates what percentage of the original text is shown in the summary, serves for the selection of the number of sentences given by this reduction measure from the sorted representation. If the most important x-percent sentences are selected, they are displayed as a summary of the text in its original order given by this text.

An advantageous further development of the method according to the invention consists in introducing an frequency of Emzelworth in addition to the relevance measure. This level of detail indicates how often the word in question appears in the entire text to be summarized. Taking into account the relevance measure and this newly introduced

The frequency of the individual sentence m in the summary can be specified by the following rule:

where ^wκ (sentence ₎ ^{is the} probability of the sentence belonging to the summary,

N is the total number of words in the

Sentence, l is a number variable (ι = l, 2, ..., N) for all words in the sentence, tf the frequency of the occurrence of the word in question in the entire text to be summarized (frequency of individual words) and rlv the relevance measure for the respective word in the sentence , describe.

It should be noted here that the words occurring in the lexicon with their relevance measure rlv known from the lexicon are decisive. If a word that does not exist in the lexicon occurs n times, this word does not increase the probability that the sentence belongs to the summary.

A further development of the method according to the invention consists in using an application-specific lexicon. This means that the summary is carried out with a predefinable subject-specific filter. For example, a lexicon specified for sports contributions will rate sports-related words with a higher relevance for a text to be summarized than a lexicon that specializes in summaries of economic contributions. It is therefore advantageously possible to provide specific knowledge about predefinable categories by means of lexica corresponding to the respective categories.

It is also advantageous to assign a text to one or more categories. This can be done automatically by using specific, predefinable words in the subject-related lexica as a selection criterion for an assignment to the respective subject area. If several categories (subject areas), i.e. different perspectives or filters, are possible for the summary of a text, different summaries, one for each category, can be created automatically.

The invention is further explained on the basis of an exemplary embodiment which is shown in the figures.

Show it

1 is a sketch illustrating a system for automatically generating a summary;

Fig. 2 is a block diagram illustrating the steps of the method according to the invention.

1 shows a system with which an automatic generation of a summary of text by a

Calculator is performed. A text to be summarized can either be written TXT, e.g. on paper, or in digital form DIGTXT, e.g. as the result of a database query.

In order to be able to edit the text in paper form TXT in accordance with the invention, it is necessary to make it accessible to the computer. For this purpose, the text TXT is read in by the scanner SC and stored as an image file BD. A text recognition software OCR converts the text TXT m present as an image file BD into a machine-readable format, e.g. ASCII format to. The digital text DIGTXT is already available in machine-readable format.

Furthermore, a predeterminable number of topic-related

Encyclopedias, a lexicon for every subject, in stock. In Fig. 1, the subject-related encyclopedias are indicated as blocks LEX1, LEX2 and LEX3. There are many ways in which the contents of the subject-related encyclopedias are structured. One possibility is to automatically analyze categorized texts by choosing word clues as a significant criterion for the respective category.

On the basis of the lexica it is possible to automatically categorize the text to be summarized (in the KatSel block), in that predefinable words in the topic-related lexica, if they appear in the text to be summarized, are decisive for a summary in relation to the relevant topic-related lexicon. In such a case, a subject-related summary will be created that matches this lexicon.

It should be noted here that the words m in the text to be summarized are advantageously returned to their respective basic form (this is done in the block LEM) and that each word is given a reference to its part of speech (block TAG).

For each category (topic), the summary according to the invention is created using the corresponding lexicon (in the KatSel block). There are subject-specific summaries ZFS1 and ZFS2.

The steps leading to the summary of the text are shown in detail in FIG. 2. For the sake of clarity, the abbreviations used in FIG. 2 are summarized below:

SZ set,

WK (SZ) probability for sentence SZ,

W word, tf (W) Emzelworth frequency of the word W (in the sentence SZ) and rev (W) relevance measure of the word W (in the sentence SZ). In step 2a, the first sentence is selected at the beginning of the method according to the invention and the probability that this sentence belongs to the summary is set to 0. In step 2b, the first word of this sentence is selected. Since the probability that this sentence belongs to the summary is derived from the

If the probabilities of the individual words are put together, for each word in the sentence in the loop from step 2c to step 2e, the respective probability is cumulated to the overall probability for the entire sentence. Once all the words in the sentence have been processed, the probability for the individual sentence is normalized by the number of words. The steps described are carried out for all sentences in the text (step 2g, 2h, 2ι). If the last sentence in the text has been processed, the sentences are after their

Probability sorted (step 2j). According to a predeterminable reduction measure, the n best sentences corresponding to the reduction measure are selected in step 2k and then their original sequence is displayed in step 2m.

Bibliography :

[1] J.Kupiec, J.Pedersen and F.Chen, "A Trainable Document Summarizer", Xerox, Palo Alto Research Center, 1995.

[2] EP 0 751 470 AI

Claims

claims

1. Procedure for the automatic generation of a

Summary of a text by a computer, a) in which a probability is determined for each sentence that the sentence belongs to the summary by, for each word m the sentence from a lexicon, the application-specific words with a predetermined relevance measure for each of these Contains words, the relevance measure is determined and all

Relevance measures cumulatively result in the probability of the sentence belonging to the summary, b) in which all sentences of the text are sorted according to the probabilities, c) in which, according to a predeterminable reduction measure for the summary, the best sentences are displayed in an order given by the text .

2. The method as claimed in claim 1, in which, in addition to the relevance measure, a frequency of emzelf is determined for each word and the probability that the respective sentence m is included in the summary is determined by the following rule:

wherein ^wκ (set) ^dle probability of membership of the set to the summary, N the total number of these words occurring in a sentence, I is a number variable (ι = l, 2, ..., N) for all the words in the sentence, tf the frequency of the occurrence of the word in question in the entire text to be summarized

(Frequency of individual words) and rlv the measure of relevance for the respective word in

Sentence, denote.

3. The method according to claim 1 or 2, wherein the text is assigned to one or more categories, for each of which an application-specific lexicon is used.

4. The method according to any one of claims 1 to 3, in which an application-specific summary is created for each assignment of the text to a category.