JPH03191475A

JPH03191475A - Document summarizing system

Info

Publication number: JPH03191475A
Application number: JP1332091A
Authority: JP
Inventors: Takeshi Nishimura; 健士西村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1989-12-20
Filing date: 1989-12-20
Publication date: 1991-08-21

Abstract

PURPOSE:To accelerate the automation of formation of a summarized document by extracting an important sentence in the meaning of dependency upon the sort of each paragraph and allowing a summarized document forming rule to correspond to an important document extracting rule in summarized document forming processing. CONSTITUTION:An unprocessed paragraph in a document is segmented. Then, the sort of the paragraph is decided based upon a heading word. Then, an important sentence is extracted from the paragraph by using an important sentence extracting rule group depending upon the sort of the selected paragraph and an important sentence extracting rule group applicable to any sorts. The extracted sentence is stored together with the paragraph number and the heading. This processing is repeated for all paragraphs in the document, and finally the paragraph numbers, the headings of respective paragraphs and the extracted sentences are displayed in the order of paragraph numbers.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は計算機を用いて文書の要約文を生成する文書要
約方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a document summarization method that uses a computer to generate a document summary.

[Conventional technology]

従来、文書の内容を要約する技術として、岩淵他著：全
文情報からの意味的情報の抽出と加工（情報処理学会第
３８口金国大会予稿集（１９８９年。Traditionally, as a technique for summarizing the contents of documents, Iwabuchi et al.: Extraction and Processing of Semantic Information from Full-Text Information (Proceedings of the 38th National Conference of the Information Processing Society of Japan (1989)).

２２２頁）に記載されたものが知られている。これは字
種や助詞などの情報から文書中の重要語を切り出し、重
要語の出現頻度で最重要語を決定し、重要語と最重要語
が含まれているか否かで重要文を決定することを基本と
している。また、要約文を生成する際には、抽出された
重要文から「上記」、「ここでは」などの特殊な語句を
削除することに関して自動化を行なっている。222) is known. This extracts important words from a document based on information such as character types and particles, determines the most important words based on the frequency of occurrence of the important words, and determines important sentences based on whether the important words and the most important words are included. It is based on that. Furthermore, when generating a summary sentence, special words such as ``above'' and ``here'' are removed from the extracted important sentences automatically.

[Problem to be solved by the invention]

上記の技術では一文単位にそれが重要か否かを判定して
いるが、実際の文書では、あるまとまった部分が他のま
とまった部分に対して相対的な重要性を持っている。ま
とまった部分としては、例えば段落が考えられる。重要
語が文中に存在すること以外に全体に対してのその文の
含まれている段落の重要度を考慮する必要がある。In the above technology, it is determined whether each sentence is important or not, but in actual documents, certain parts have relative importance to other parts. For example, a paragraph can be considered as a grouped part. In addition to the presence of important words in a sentence, it is necessary to consider the importance of the paragraph in which the sentence is included relative to the whole.

誌な、ある語句が文書全体の内容を表す的確さと、その
語句の文書全体での出現頻度とは必ずしも一致しない。For example, the accuracy with which a word or phrase expresses the content of an entire document does not necessarily match the frequency with which that word appears in the entire document.

さらに、文中での位置によってその語句には多様な構文
的意味的役割が考えられるので、単に文中に出現するか
どうかをチエツクするだけでなく、さらに詳細な解析を
行なうことが望ましい。Furthermore, since a word can have various syntactic and semantic roles depending on its position in a sentence, it is desirable to perform more detailed analysis than simply checking whether it appears in a sentence.

[Means to solve the problem]

本発明の文書要約方式は、与えられた文書を見出しのつ
いた段落単位に分割し、段落の種類に対応した重要文抽
出規則群を段落毎に選択し、前記重要文抽出規則群と段
落の種類に依存しない重要文抽出規則群の双方を参照し
て前記各段落における重要文を抽出し、前記各段落で抽
出された重要文と前記各段落の見出しの対の集合を前記
与えられた文書全体の重要部分と判定し出力するように
して実現される。The document summarization method of the present invention divides a given document into paragraph units with headings, selects a group of important sentence extraction rules corresponding to the type of paragraph for each paragraph, and The important sentences in each paragraph are extracted by referring to both types-independent important sentence extraction rules, and the set of pairs of the important sentences extracted in each paragraph and the heading of each paragraph is extracted from the given document. This is realized by determining that it is an important part of the whole and outputting it.

また本発明の文書要約方式は、各重要文抽出規則に対し
てあらかじめ対応づけられた要約文生成規則を適用して
前記重要文を変形し連結し出力するようにして実現され
る。Further, the document summarization method of the present invention is realized by applying a summary sentence generation rule associated with each important sentence extraction rule in advance to transform, connect, and output the important sentences.

さらにまた本発明の文書要約方式は、各重要文抽出規則
に対してあらかじめ設定された重要度を参照してあらか
じめ定義された境界値を越えた前記重要文のみを選択し
要約文を生成するようにして実現される。Furthermore, the document summarization method of the present invention refers to the importance set in advance for each important sentence extraction rule, selects only the important sentences that exceed a predefined boundary value, and generates a summary sentence. This will be realized as follows.

[Effect]

本発明では文書を段落単位に分割し、その段落の性質を
反映させた重要文の抽出を行なう。その際、二種類の規
則群を参照する。従来技術の範囲内である文書全体に共
通した重要度情報は段落に依存しない重要文抽出規則群
に格納されており、段落の種類に依存した規則は個別の
重要文抽出規則群として格納されている。前者は一つ、
後者は段落の種類だけ用意されている。In the present invention, a document is divided into paragraphs, and important sentences that reflect the characteristics of the paragraphs are extracted. In this case, two types of rule groups are referred to. Importance information common to the entire document, which is within the scope of the prior art, is stored in a group of important sentence extraction rules that do not depend on paragraphs, and rules that depend on the type of paragraph are stored as separate groups of important sentence extraction rules. There is. The former is one;
The latter provides only paragraph types.

例えば、次のような段落から構成されている文書がある
とする。For example, suppose you have a document consisting of the following paragraphs.

１、はじめに２、研究の目的３、実験装置４、実験結果５、考察６、参考文献第１段落の見出し「はじめに」より、この段落の処理の
ための「導入部規則群」が選択されるが、この規則群で
は文書の主題に関する文を抽出するような規則が記述さ
れている。−例を挙げると、「〜について述べる。」と
いう文が抽出される。同様に、第５段落では「意見記述
部規則群」。1. Introduction 2. Research purpose 3. Experimental equipment 4. Experimental results 5. Discussion 6. References From the heading ``Introduction'' in the first paragraph, ``Introduction rules group'' for processing this paragraph is selected. However, this set of rules describes rules for extracting sentences related to the subject of a document. - For example, the sentence "I will talk about..." is extracted. Similarly, in the fifth paragraph, "Rules for Opinion Description Section".

第６段落では「非重要部規則群」が選択される。In the sixth paragraph, "non-important part rule group" is selected.

前者では著者の主張、意見を表した表現を含む文の抽出
が強化されており、後者では文の抽出を極力抑制するよ
うな規則が記述される。このようにして本方式において
は文に含まれる段落の特徴を反映した重要文の抽出が可
能となる。The former enhances the extraction of sentences that include expressions expressing the author's assertions and opinions, while the latter describes rules that suppress the extraction of sentences as much as possible. In this way, this method makes it possible to extract important sentences that reflect the characteristics of the paragraphs included in the sentences.

また、各規則は段落の種類に依存した詳細なものである
から、各規則に対し、その規則によって抽出されるべき
文の重要度を記述しておくことができる０例えば上記の
例で、「導入部規則群」中の各規則よりは「意見記述部
規則群」中の規則の方が平均的に重要であるし、第４段
落において数値表現を含む文は他の段落の数値表現を含
む文に比べて重要だと判断できる。Also, since each rule is detailed depending on the type of paragraph, it is possible to describe for each rule the importance of sentences to be extracted by that rule.For example, in the above example, On average, the rules in the ``opinion statement group'' are more important than the rules in the ``introduction rule group,'' and a sentence that includes a numerical expression in the fourth paragraph contains a numerical expression in another paragraph. It can be judged that it is more important than the text.

同様の理由で、各規則に対し、要約文生成時の文生成規
則を記述することができる０例えば、「意見記述部規則
群」中で「αであると結論を得た。その理由はβである
。Ｊという文を抽出する規則があれば、これを要約文生
成時に、「βだからαである」と簡単に表現することが
できる。For the same reason, it is possible to describe the sentence generation rule for generating a summary sentence for each rule. If there is a rule for extracting the sentence J, this can be easily expressed as ``Since β, α is'' when generating a summary sentence.

〔Example〕

以下、本発明について図面を参照しながら説明する。 Hereinafter, the present invention will be explained with reference to the drawings.

第１図は本発明の第一の実施例を示す流れ図である。ま
ず、段落の切り出し１１において文書中の未処理の段落
が一つ切り出される０次に、段落の種類判定１２におい
て、見出しの言葉を基にその段落の種類が判定される。FIG. 1 is a flow chart showing a first embodiment of the present invention. First, in paragraph cutout 11, one unprocessed paragraph in the document is cut out.Next, in paragraph type determination 12, the type of the paragraph is determined based on the words of the heading.

そして規則群選択１３において、段落の種類に基づいて
適切な重要文抽出規則群が選択される。Then, in rule group selection 13, an appropriate important sentence extraction rule group is selected based on the type of paragraph.

抽出処理１４では、選択された段落の種類に依存する重
要文抽出規則群とどの種類の段落にも適用可能な重要文
抽出規則群の２つを用いて、段落中から重要な文を抽出
する。この際、前者の重要文抽出規則群中には後者の重
要文抽出規則群中の規則の適用を押えるような規則も書
けるようにしておけば柔軟性が高い抽出方式にすること
が可能である。抽出された文は抽出データ格納１５にお
いて、段落の番号、見出しとともに格納される。In the extraction process 14, important sentences are extracted from the paragraph using two important sentence extraction rules: one that depends on the type of paragraph selected, and one that can be applied to any type of paragraph. . At this time, if rules can be written in the former group of important sentence extraction rules that suppress the application of rules in the latter group of important sentence extraction rules, it is possible to create a highly flexible extraction method. . The extracted sentence is stored in the extracted data storage 15 together with the paragraph number and heading.

文書中の全ての段落に対して以上の処理が繰り返され（
ステップ１６）、最後に表示１７において段落の番号順
に、段落の番号２段落の見出し。The above process is repeated for all paragraphs in the document (
Step 16) Finally, in display 17, in the order of the paragraph numbers, the paragraph number 2 heading.

抽出された文の３者が表示される。The three extracted sentences are displayed.

第２図は本発明の第二の実施例を示す流れ図である。同
図において段落の切り出し２１から判定処理２６までは
、抽出データ格納２５を除いて、第一の実施例（第１図
参照）と同じ処理である。FIG. 2 is a flow chart showing a second embodiment of the present invention. In the figure, the steps from paragraph cutout 21 to determination process 26 are the same as in the first embodiment (see FIG. 1), except for extraction data storage 25.

抽出データ格納２５においては、抽出文、要約文生成規
則番号の三者が格納される。In the extracted data storage 25, three items are stored: an extracted sentence and a summary sentence generation rule number.

ここで、第４図に重要文抽出規則群の構成を、第５図に
要約文生成規則群の構成を示す０重要文抽出規則群４０
は文抽出規則内容４１と抽出された文を要約文に変換す
るときに用いる規則を指す要約文生成規則番号４２の項
目を含む、要約文生成規則群５０はシステムにおいて一
意に決まる規則の番号５１と要約文生成規則内容５２の
２つの項目からなる。抽出データ格納２５において格納
される要約文生成番号はこの要約文生成規則番号４２に
格納されているものである。Here, FIG. 4 shows the structure of the important sentence extraction rule group, and FIG. 5 shows the structure of the summary sentence generation rule group 40.
The summary sentence generation rule group 50 includes a sentence extraction rule content 41 and a summary sentence generation rule number 42 indicating the rule used when converting the extracted sentence into a summary sentence.The summary sentence generation rule group 50 includes a rule number 51 that is uniquely determined in the system. It consists of two items: and summary sentence generation rule content 52. The summary sentence generation number stored in the extracted data storage 25 is the one stored in this summary sentence generation rule number 42.

次に、抽出文取り出し２７において、処理の終っていな
い抽出文のうち最も文書の前方に現れる文が一つ選択さ
れる。その文に対応する要約文生成規則が要約文生成規
則番号をもとに要約文生成規則群５０から検索され、要
約文生成が要約文生成２８で行なわれる。生成された文
は表示２９で表示される。処理は未処理の抽出文がなく
なるまで繰り返される（ステップ２０）。Next, in extracted sentence retrieval 27, one of the unprocessed extracted sentences that appears furthest in the document is selected. A summary sentence generation rule corresponding to the sentence is searched from the summary sentence generation rule group 50 based on the summary sentence generation rule number, and summary sentence generation is performed in the summary sentence generation 28. The generated sentence is displayed on display 29. The process is repeated until there are no more unprocessed extracted sentences (step 20).

第３図は本発明の第三の実施例を示す流れ図である。同
図において段落の切り出し３１から判定処理３６までは
、抽出データ格納３５を除いて、第一の実施例（第１図
参照）と同じ処理である。FIG. 3 is a flowchart showing a third embodiment of the present invention. In the figure, the steps from paragraph cutout 31 to determination process 36 are the same as in the first embodiment (see FIG. 1), except for extraction data storage 35.

抽出データ格納３５においては、抽出文、抽出文重要度
の２者が格納される。第４図の重要文抽出規則群４０に
は、抽出された文の重要度を示す重要度４３の項目が含
まれている。抽出データ格納３５において格納される抽
出文重要度は重要度４３に格納されているものである。In the extracted data storage 35, two items, an extracted sentence and an extracted sentence importance level, are stored. The important sentence extraction rule group 40 in FIG. 4 includes an item of importance level 43 indicating the importance level of the extracted sentence. The extracted sentence importance stored in the extracted data storage 35 is the one stored in the importance 43.

次に、抽出文取り出し３７において、処理の終っていな
い抽出文のうち最も文書の前方に現れる文が一つ選択さ
れる。その文があらかじめ決められている重要度を越え
ているかどうがか抽出文判定３８によって判定され、条
件を満たしていれば表示３９において出力される０条件
を満たしていなければ次の抽出文の判定に移る。処理は
未処理の抽出文がなくなるまで繰り返される（ステップ
３０）。Next, in extracted sentence retrieval 37, one of the unprocessed extracted sentences that appears furthest in the document is selected. Extracted sentence determination 38 determines whether the sentence exceeds a predetermined level of importance. If the condition is met, the display 39 outputs 0. If the condition is not met, the next extracted sentence is determined. Move to. The process is repeated until there are no more unprocessed extracted sentences (step 30).

第６図は上記の文書要約方式の構成を示す説明図である
。同図において段落の種類判定プログラム６１は段落の
種類判定を行ない、抽出処理プログラム６２は重要文抽
出規則ベース６７を参照して重要文の抽出を行なう０重
要文抽出規則ベース６７には段落に依存しない重要文抽
出規則群と各種の段落に対応した重要文抽出規則群が格
納されている。FIG. 6 is an explanatory diagram showing the structure of the above document summarization method. In the figure, a paragraph type determination program 61 determines the type of paragraph, and an extraction processing program 62 refers to an important sentence extraction rule base 67 to extract important sentences. A group of important sentence extraction rules corresponding to various paragraphs and a group of important sentence extraction rules corresponding to various paragraphs are stored.

抽出文選択プログラム６３はあらかじめ決められている
重要度の境界を越えた抽出文の選択を行なう、要約文生
成プログラム６４は要約文生成規則ベース６８を参照し
て要約文を生成する。要約文生成規則ベース６８には要
約文生成規則群が格納されている。The extracted sentence selection program 63 selects extracted sentences that exceed the boundaries of predetermined importance, and the summary sentence generation program 64 generates a summary sentence with reference to the summary sentence generation rule base 68. A summary sentence generation rule base 68 stores a group of summary sentence generation rules.

表示プログラム６５は抽出文や要約文をＣＲＴ６６に表
示する。全体制御部６０は以上の各プログラムの制御を
行なう０作業用メモリ６９は第１図、第２図、第３図に
おける抽出データ格納１５．２５．３５に用いられる。The display program 65 displays the extracted sentence and summary sentence on the CRT 66. The overall control section 60 controls each of the programs mentioned above. The working memory 69 is used for storing extracted data 15, 25, and 35 in FIGS. 1, 2, and 3.

〔Effect of the invention〕

本発明の重要文抽出処理では、各段落の種類に依存した
意味で重要な文を抽出し表示時に段落の見出しと対応づ
けて提示するので、文書全体の内容とともに文書全体の
主題の流れを容易に把握することができる。The important sentence extraction process of the present invention extracts important sentences in a sense that depends on the type of each paragraph and presents them in association with the heading of the paragraph when displayed, making it easy to understand the content of the entire document as well as the flow of the subject of the entire document. can be grasped.

また、要約文生成処理においては、重要文抽出規則に対
して要約文生成規則の対応づけを行っているので、要約
文生成の記述がより詳細に指定できるようになり、要約
文生成の自動化が促進される。また、重要文抽出規則に
対してその抽出文の重要度が定義可能になったので、要
約レベルに応じた抽出文の絞り込みの手段が提供される
ことになる。In addition, in the summary sentence generation process, the summary sentence generation rules are associated with the important sentence extraction rules, so the summary sentence generation description can be specified in more detail, making it possible to automate summary sentence generation. promoted. Furthermore, since it is now possible to define the importance of an extracted sentence with respect to the important sentence extraction rule, a means for narrowing down the extracted sentences according to the summary level is provided.

本発明の文書要約方式は、段落が存在しないような短い
文書ではなく、段落によって構造化された比較的長い文
書を対象としている０段落の存在しない文書の要約は段
落に依存しない重要文抽出規則のみによって要約するこ
とに相当する。つまり、本方式は段落のない文書への要
約方式の拡張になっている。The document summarization method of the present invention targets relatively long documents structured by paragraphs, rather than short documents with no paragraphs.Summary of a document with no paragraphs uses important sentence extraction rules that do not depend on paragraphs. This corresponds to summarizing only by In other words, this method is an extension of the summary method to documents without paragraphs.

[Brief explanation of drawings]

第１図は本発明の第一の実施例を示す流れ図、第２図は
第二の実施例を示す流れ図、第３図は第三の実施例を示
す流れ図、第４図は重要文抽出規則群の構成を示す説明
図、第５図は要約文生成規則群の構成を示す説明図、第
６図は構成例を示す説明図である。４０・・・重要文抽出規則群、５０・・・要約文生成規
則群、６０・・・全体制御プログラム、６１・・・段落
の種類判定プログラム、６２・・・抽出処理プログラム
、６３・・・抽出文選択プログラム、６４・・・要約文
生成プログラム、６５・・・表示プログラム。Fig. 1 is a flow chart showing the first embodiment of the present invention, Fig. 2 is a flow chart showing the second embodiment, Fig. 3 is a flow chart showing the third embodiment, and Fig. 4 is a flow chart showing important sentence extraction rules. FIG. 5 is an explanatory diagram showing the configuration of a group of summary sentence generation rules, and FIG. 6 is an explanatory diagram showing a configuration example. 40... Important sentence extraction rule group, 50... Summary sentence generation rule group, 60... Overall control program, 61... Paragraph type determination program, 62... Extraction processing program, 63... Extracted sentence selection program, 64...Summary sentence generation program, 65... Display program.

Claims

[Claims] 1. In a document summarization method that uses a computer to generate a summary sentence of a document, a given document is divided into paragraph units with headings, and a group of important sentence extraction rules corresponding to the type of paragraph is provided. is selected for each paragraph, and the important sentences in each paragraph are extracted by referring to both the important sentence extraction rule group and the important sentence extraction rule group that does not depend on the type of paragraph, and the important sentence extracted in each paragraph is extracted. and a set of pairs of headings of each paragraph are determined to be important parts of the entire given document and are output. 2. The document summarization method according to claim 1, wherein the important sentences are transformed, concatenated, and output by applying a summary sentence generation rule that is previously associated with each important sentence extraction rule. method. 3. In the document summarization method according to claim 1 or 2, only the important sentences exceeding a predefined boundary value are selected by referring to the importance set in advance for each important sentence extraction rule, and the summarized sentences are extracted. A document summarization method characterized by generating.