JPH0776971B2

JPH0776971B2 - Document abstract creation device

Info

Publication number: JPH0776971B2
Application number: JP1063472A
Authority: JP
Inventors: 孝日比
Original assignee: 工業技術院長
Priority date: 1989-03-17
Filing date: 1989-03-17
Publication date: 1995-08-16
Anticipated expiration: 2010-08-16
Also published as: JPH02289060A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、機械で読み取り可能な文書から抄録を自動的
に作成するための文書抄録作成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document abstract creating apparatus for automatically creating an abstract from a machine-readable document.

（従来の技術）従来、文書の抄録作成に関する技術としては、キーワー
ドを抽出して検索を容易にする方式のもの、及び文章の
表層的な情報を用いて重要部分を抽出する方式のものが
あった。(Prior Art) Conventionally, as a technology for creating an abstract of a document, there is a method of extracting a keyword to facilitate a search, and a method of extracting an important part using surface information of a sentence. It was

前者のキーワード方式は、文書を単語に分割し、単語の
頻度、出現場所などを記録し、各単語に対して評価関数
を適用して評価値を求め、評価の高い語をその文書のキ
ーワードとするものである。これにより、大量の文書か
ら求める文書の検索を容易にするのを主な目的としてい
る。さらに、このキーワード方式において文の長さとそ
こに含まれているキーワードの数から文の重要度を評価
する方法を提案されている（文献：「ザオートマチッ
ククリエーションオブリタレイチャーアブスト
ラクト（The Automatic Creation of Literature Abstr
acts）、IBMジャーナル、1958年４月」）。The former keyword method divides the document into words, records the frequency and appearance location of the words, applies an evaluation function to each word to obtain an evaluation value, and assigns words with high evaluation as the keywords of the document. To do. The main purpose of this is to facilitate the search for a desired document from a large number of documents. Furthermore, in this keyword method, a method of evaluating the importance of a sentence based on the length of the sentence and the number of keywords included in the sentence has been proposed (Reference: "The Automatic Creation of Literature Abstr
acts), IBM Journal, April 1958 ").

後者の文書の表層的な情報を用いて重要部分を抽出する
方式には、いくつかの方式が提案されているが、文の主
動詞に着目し、それらの関係を解析することによって文
書の構造を決定し、それから主要部分を取り出すもの、
接続の関係に着目するものなどが挙げられる。Several methods have been proposed for extracting important parts using the surface information of the latter document, but by focusing on the main verbs of a sentence and analyzing their relationships, the structure of the document Decide what, and then take out the main part,
Some examples are those that focus on the connection relationship.

（発明が解決しようとする課題）このような従来の方式では、単語単位または文単位のい
ずれかで重要度の評価を行っていたため、１つの文の中
の重要な部分のみを取り出すことはできない。(Problems to be Solved by the Invention) In such a conventional method, the importance is evaluated on a word-by-word basis or on a sentence-by-sentence basis, so that it is not possible to extract only an important part of one sentence. .

キーワード方式では、文書の検索には役立つが、内容の
チェックまでは行っていないので、どの部分が特に重要
であるかはわからず、文書の内容まで見る必要があると
いう問題があった。また、この際に使われる評価方法は
その単位を語にしているため、キーセンテンスがどれで
あるかを決定するには利用できないという問題があっ
た。The keyword method is useful for searching documents, but since it does not check the contents, there is a problem that it is necessary to see the contents of the document without knowing which part is particularly important. In addition, the evaluation method used in this case has a problem that it cannot be used to determine which key sentence is because the unit is a word.

又、表層的な情報を用いて重要部分を抽出する方式で
は、キーワード方式での問題はないが、表層的な情報の
みを用いた場合、あまり重要でない部分が抽出されるこ
とが多いという問題があった。例えば、列挙などが含ま
れる部分などが抽出されてしまう。また、重要度の評価
の方法は固定されているため、異なった観点での評価を
することができないという問題もあった。In addition, in the method of extracting important parts using surface information, there is no problem in the keyword method, but when only surface information is used, unimportant parts are often extracted. there were. For example, a part including enumeration will be extracted. Moreover, since the method of evaluating the importance is fixed, there is a problem that the evaluation cannot be performed from different viewpoints.

この発明の目的は、従来の方法よりも精密な評価を行う
こと、及び要求に応じた重要度の評価を行うことを可能
とした文書抄録作成装置を提供することにある。An object of the present invention is to provide a document abstract creating apparatus capable of performing a more accurate evaluation than the conventional method and an evaluation of importance according to a request.

（課題を解決するための手段）この目的を達成するために、本発明の文書抄録作成装置
は、入力文書の文を読み込むための入力装置と、読み込
んだ文書の文脈構造の解析を行ない、少なくとも文と文
との接続関係とその方向の情報及び文書内に現れる照応
と省略の情報を含む情報を文書構造の解析結果として出
力する文脈解析装置と、文及び単語の重要度評価のため
に、文に含まれる語彙から単語及び文の評価値を計算す
るための語彙レベルのルールと前記文脈構造の解析結果
から文の評価値を計算するための文書構造レベルのルー
ルとを含むルールのデータベースを記憶するデータベー
ス装置と、前記ルールを用いて語彙レベルの評価を行な
うと共に前記文脈解析装置によって出力される文書構造
の解析結果から文及び単語の重要度の評価を行なう評価
装置とを有することを特徴とする。(Means for Solving the Problem) In order to achieve this object, a document abstract creation apparatus of the present invention analyzes an input device for reading a sentence of an input document and a context structure of the read document, and at least For the context analysis device that outputs information including the relationship between sentences and sentences and the direction thereof and information including anaphora and omission information that appears in a document as an analysis result of the document structure, and for the importance evaluation of sentences and words, A database of rules including a vocabulary level rule for calculating a word and a sentence evaluation value from a vocabulary included in a sentence and a document structure level rule for calculating a sentence evaluation value from an analysis result of the context structure. A database device for storing and a vocabulary level are evaluated using the rules, and the importance of sentences and words is evaluated from the analysis result of the document structure output by the context analysis device. And an evaluation device for performing.

（作用）上記の第１の問題に対しては、評価の対象を単語のみで
なく、単語及び文章としているため、重要文だけを拾い
出すことができる。また、表層情報のみを用いた方法の
欠点を補うため、文書の構造を解析し、その結果を用い
て重要部分を抽出する。これにより、文書中の著者の伝
えたい部分をより正確に取り出すことができる。第３の
問題に対しては、評価に用いるルールを変更できるの
で、ある着目点を特に重視した評価を行うことができ
る。例えば、文書の構造を重視した評価や、あるキーワ
ードを重視した評価を行うことができる。(Operation) With respect to the first problem described above, since not only words but also words and sentences are evaluated, only important sentences can be picked up. Moreover, in order to make up for the shortcomings of the method using only surface information, the structure of the document is analyzed, and the important part is extracted using the result. This makes it possible to more accurately extract the part of the document that the author wants to convey. With respect to the third problem, the rule used for evaluation can be changed, so that evaluation can be performed with particular emphasis on a certain point of interest. For example, it is possible to perform an evaluation that emphasizes the structure of a document and an evaluation that emphasizes a certain keyword.

（実施例）以下、図面を参照してこの発明の文書抄録作成装置の実
施例につき説明する。(Embodiment) An embodiment of the document abstract creation apparatus of the present invention will be described below with reference to the drawings.

第１図は本発明の概要を示す構成図であって、10は入力
文書であり、12は入力文書10の文を読み込むための入力
装置、14は読み込んだ文書の文脈解析を行う文脈解析装
置、16は読み込んだ文書の文と単語の重要度を評価する
評価装置、18は文書の抄録を生成する生成装置であり、
20は評価装置16において重要度の評価に使用するルール
のデータベースを記録したデータベース装置である。FIG. 1 is a block diagram showing an outline of the present invention, 10 is an input document, 12 is an input device for reading a sentence of the input document 10, and 14 is a context analysis device for performing a context analysis of the read document. , 16 is an evaluation device that evaluates the importance of sentences and words in the read document, and 18 is a generation device that generates an abstract of the document,
Reference numeral 20 is a database device that records a database of rules used in the evaluation device 16 to evaluate importance.

最初に入力装置12により、入力文書10を読み込む。この
とき、辞書を参照することにより、形態素解析が行なわ
れ、文書は単語に分割され品詞情報等が付け加えられて
記憶装置12aに一旦記憶する。次に入力装置12の出力を
受けて、文脈解析装置14により、文書構造の解析を行
い、その結果を文脈解析装置14内に設けた適当な記憶装
置14aに格納する。この文書構造の解析結果には、「文
と文の接続関係とその方向」の情報及び「文書内に現わ
れる照応と省略」の情報とが含まれている。First, the input device 10 reads the input document 10. At this time, the morpheme analysis is performed by referring to the dictionary, the document is divided into words, and the part-of-speech information and the like are added and temporarily stored in the storage device 12a. Next, receiving the output from the input device 12, the context analysis device 14 analyzes the document structure, and stores the result in an appropriate storage device 14a provided in the context analysis device 14. The analysis result of the document structure includes the information of “sentence-to-sentence connection relation and its direction” and the information of “anaphora and omission appearing in document”.

次に、評価装置16により、文及び単語の重要度の評価を
行う。Next, the evaluation device 16 evaluates the importance of sentences and words.

本装置では、文及び単語の重要度を評価するため、入力
装置12の出力である形態素解析の結果、記憶装置14aに
一旦格納した構文解析の結果、文脈解析の結果を読み出
して用いる。また、第１図に示すデータベース装置20か
ら、評価用のルールのデータベースを読み出してきて、
これにより文または単語に点数を与える規則と、各規則
に対する重みづけを与える。このため、ルールの追加、
重みづけの変更が容易に行える。In order to evaluate the importance of sentences and words, this device reads and uses the result of morphological analysis output from the input device 12, the result of syntactic analysis once stored in the storage device 14a, and the result of context analysis. In addition, a database of evaluation rules is read from the database device 20 shown in FIG.
This gives rules for giving scores to sentences or words and weighting for each rule. Therefore, adding rules,
The weighting can be changed easily.

ここで用いるルールのデータベースは２つの部分から成
っている。The rules database used here consists of two parts.

ルールのデータベースの第１の部分は入力装置12の出力
である形態素解析の結果を用い、語彙レベルの情報で重
要度の評価を行う。この重要度をはかるための語句のパ
ターンによる重要性評価ルールを持っており、これを用
いたルールによって、重要度の計算を行う。The first part of the rule database uses the result of the morphological analysis output from the input device 12 to evaluate the importance based on the vocabulary level information. It has an importance evaluation rule based on a word pattern for measuring this importance, and the importance is calculated by the rule using this.

このルールの例は次に述べるようなものである。キーワ
ードとして以下のものを与える。An example of this rule is as follows. The following are given as keywords.

高頻度の語：文書中の語の頻度を文書中の語数で割り、
それが一定以上のものを高頻度の語とする。ただし、助
詞、助動詞などの機能語は語数に含めない。Frequent words: Divide the frequency of words in the document by the number of words in the document,
A word with a certain level or more is a high-frequency word. However, functional words such as particles and auxiliary verbs are not included in the number of words.

強調表現：強調語やかっこで囲まれた語を強調表現とす
る。Emphasized expression: Emphasized words are words emphasized or enclosed in parentheses.

重要語：その部分が重要であることを明示している語
（例：結局、要するに、等）分野別の重要語：文書の分野、種類ごとに、重要度を示
す単語のセットを与える。Important word: A word that clearly indicates that the part is important (eg, after all, in short, etc.) Important word by field: Gives a set of words indicating the degree of importance for each field and type of document.

これらの語にはそれぞれ重みが与えられる。文の中のキ
ーワードの点数の和をもって文章の得点とする。Each of these words is given a weight. The score of the sentence is the sum of the scores of the keywords in the sentence.

ルールのデータベースの第２の部分では解析された文書
構造の情報を用いる。この情報は、文の間の依存関係、
及び照応の関係を含んでいる。The second part of the rule database uses the parsed document structure information. This information depends on the dependencies between the statements,
And anaphoric relationships.

評価基準の１例を以下に挙げる。An example of the evaluation criteria is given below.

その文にかかっている文の数：文の依存関係を調べ、そ
の文に直接、または間接的にかかっている文の数を数
え、得点とする。ここで、文１が文２にかかっていると
いうのは、２つの文の間に依存関係があり、その方向が
１から２の方向であることを示す。Number of sentences that depend on the sentence: Check the dependency of sentences and count the number of sentences that directly or indirectly depend on the sentence to obtain the score. Here, the sentence 1 spans the sentence 2 means that there is a dependency between the two sentences and the direction is from 1 to 2.

語が照応されている回数：単語があとで照応されている
数を数え、その単語の得点とする。文に対しては文中の
各単語の得点の合計を求め、文の得点とする。Number of times a word is anaphorated: The number of times a word is anaphorated is counted and used as the score for that word. For sentences, the total score of each word in the sentence is calculated and used as the sentence score.

埋め込み文のチェック：埋め込み文に対しては、マイナ
スの評価点を与える。Embedded sentence check: A negative evaluation point is given to the embedded sentence.

文の種類による評価：文の種類及び文書の種類に依存す
る関数を決め、それによって得点を定める。例えば、論
説などでは意見を述べた部分に高い得点を与える。Evaluation by sentence type: A function depending on the sentence type and the document type is determined, and the score is determined accordingly. For example, in editorials, a high score is given to the part in which an opinion is stated.

評価装置16における評価は、それぞれのルールに関して
独立して行われる。各ルールには重みが与えられている
ので、評価値と重みの積を合計したものを得点とする。
式で表わせば、となる。ここで、Eiはｉ番目のルールに関する評価値、
Wiはｉ番目のルールに対する重みであり、Ｅが最終的な
評価値となる。The evaluation in the evaluation device 16 is performed independently for each rule. Since a weight is given to each rule, the sum of products of the evaluation value and the weight is used as the score.
Expressed as a formula, Becomes Where Ei is the evaluation value for the i-th rule,
Wi is a weight for the i-th rule, and E is the final evaluation value.

評価値を各文について求めたら、その結果を生成装置18
にわたす。生成装置18では得点の高い部分を抜き出し、
文書の抄録として出力する。When the evaluation value is obtained for each sentence, the result is generated by the generator 18
Hand over. The generator 18 extracts the high-scoring part,
Output as a document abstract.

この発明は上述した実施例にのみ限定されるものではな
い。例えば、記憶装置14aは文脈解析装置14とは独立し
て設けてもよい。又入力装置12に形態素解析の結果を一
旦記憶させる記憶装置12aを設けたが、入力装置12とは
独立して設けてもよい。又、データベース装置20を評価
装置16内に設けてもよい。The invention is not limited to the embodiments described above. For example, the storage device 14a may be provided independently of the context analysis device 14. Further, although the input device 12 is provided with the storage device 12a for temporarily storing the result of the morphological analysis, it may be provided separately from the input device 12. Further, the database device 20 may be provided in the evaluation device 16.

（発明の効果）上述した説明からも明らかなように、この発明の文書抄
録作成装置によれば、文章からその重要部分を自動的に
抽出することにより、文書の拾い読み、必要な文献の検
索が容易になる。(Effects of the Invention) As is clear from the above description, according to the document abstract creation apparatus of the present invention, it is possible to browse a document and search for a necessary document by automatically extracting an important part from a sentence. It will be easier.

[Brief description of drawings]

第１図は、この発明の文書抄録作成装置の一例を示すブ
ロック図である。 10……入力文書、12……入力装置 12a……記憶装置、14……文脈解析装置 14a……記憶装置、16……評価装置 18……生成装置、20……データベース装置。FIG. 1 is a block diagram showing an example of a document abstract creating apparatus of the present invention. 10 …… input document, 12 …… input device 12a …… storage device, 14 …… context analysis device 14a …… storage device, 16 …… evaluation device 18 …… generation device, 20 …… database device.

Claims

[Claims]

1. An input device for reading a sentence of an input document and an analysis of a context structure of the read document, and at least information of connection relation between the sentence and the sentence and its direction, and anaphora and omission which appear in the document. A context analysis device that outputs information including information as an analysis result of a document structure, and a vocabulary level rule for calculating an evaluation value of a word and a sentence from a vocabulary included in the sentence to evaluate the importance of the sentence and the word And a database device for storing a rule database including a document structure level rule for calculating a sentence evaluation value from the context structure analysis result, and a vocabulary level evaluation using the rule and the context analysis An apparatus for creating a document abstract, comprising: an evaluation device that evaluates the importance of sentences and words from the analysis result of the document structure output by the device.