JPH11338867A

JPH11338867A - Document summarizing method and device and storage medium storing document summarizing program

Info

Publication number: JPH11338867A
Application number: JP10142861A
Authority: JP
Inventors: Toshinori Izudera; 俊哲巖寺
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-05-25
Filing date: 1998-05-25
Publication date: 1999-12-10

Abstract

PROBLEM TO BE SOLVED: To summarize even a document that includes the new information without preparing any heuristics by extracting a part where an object document is best featured from the object document based on the feature information and outputting the extracted part as a summary that features the object document. SOLUTION: In a document summarizing method which extracts the information featuring an object document that is read out of an input storage means storing the document data and then summarizes the document, an object document to be summarized are compared with a standard document set, i.e., a total document set that can be processed and the feature information featuring the object document is calculated out of the object document (S1). A part that features best the object document is extracted from the object document based on the feature information (S2). Then the extracted part is outputted as a summary that features the object document (S3). Thus, it's possible to properly summarize the object document even when the contents of a document assumed at the time of its usage start are different from those of the object document or when the object document includes the new information.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書要約方法及び
装置及び文書要約プログラムを格納した記憶媒体に係
り、特に、文書情報処理に用いられる、文書内容を特徴
付ける部分を文書中から抽出することにより文書を要約
するための文書要約方法及び装置及び文書要約プログラ
ムを格納した記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document summarizing method and apparatus, and a storage medium storing a document summarizing program. In particular, the present invention relates to a method for extracting document-characterizing portions used in document information processing from a document. The present invention relates to a document summarizing method and apparatus for summarizing a document, and a storage medium storing a document summarizing program.

【０００２】[0002]

【従来の技術】近年、インターネットが急速に普及して
いる。さらに、データ記録装置は、大容量化、低価格化
している。これに伴って、大量で多様な情報がネットワ
ークを介して容易に利用可能になっている。また、ＷＷ
Ｗの普及と共に、多くのユーザが相互に情報を生成し利
用している。しかし、情報洪水と言われるように利用で
きる情報量が飛躍的に増加するに従って、これらの情報
の中から有益な情報を見つけ出して、取捨選択すること
が困難になってきている。2. Description of the Related Art In recent years, the Internet has rapidly spread. Further, the data recording device has been increased in capacity and reduced in price. Along with this, a large amount of various information has been easily available via a network. Also, WW
With the spread of W, many users mutually generate and use information. However, as the amount of information that can be used has been dramatically increased, which is referred to as an information flood, it has become difficult to find useful information from among such information and to select it.

【０００３】このような大量の情報を全て閲覧し、有益
な情報を探索し、選別することは困難である。従って、
適切な情報を効率的に利用するためには、大量の情報か
ら特徴的な情報を抽出し、必要十分な情報を選択的に利
用可能にする必要がある。現在、情報を選択的に利用す
る手段として、情報検索技術が用いられている。しか
し、ネットワークを介して利用できる情報は、その分量
が膨大であり、その内容も多岐に渡っている。このた
め、一度の検索結果として多くの類似の情報を含む文書
が選択されてしまう。多くの場合、検索結果から適切な
情報を含む文書を選択することを支援するために、各文
書を特徴付ける要約を各検索結果文書に付与し、利用者
に提示している。[0003] It is difficult to browse all such a large amount of information, search for useful information, and select it. Therefore,
In order to efficiently use appropriate information, it is necessary to extract characteristic information from a large amount of information and make necessary and sufficient information selectively available. Currently, an information search technique is used as a means for selectively using information. However, the amount of information that can be used via a network is enormous, and its contents are also diverse. For this reason, a document including many similar information is selected as a result of one search. In many cases, a summary characterizing each document is attached to each search result document and presented to a user to assist in selecting a document containing appropriate information from the search results.

【０００４】膨大な数の文書の各文書に適量適質な要約
の付与は、人手では困難である。また、人手により作業
を行った場合、付与される要約情報は、作業者のもって
いる主観や知識に影響されるため、複数の作業者によっ
て要約が行われると付与される要約情報の品質を均質に
保つことができない。そこで、情報を選択的に利用する
ためには、文書内容に適切な要約を自動的に作成し、付
与する文書要約技術が必要となる。[0004] It is difficult for a large number of documents to be given a proper and appropriate summary to each document by hand. Also, when work is performed manually, the summary information given is affected by the subjectivity and knowledge possessed by the workers. Can not be kept. Therefore, in order to selectively use information, a document summarization technique for automatically creating and adding an appropriate summarization to document contents is required.

【０００５】従来の文書要約技術の代表的なものとし
て、ヒューリスティックスを用いた要約技術がある。こ
のヒューリスティックスを用いた要約技術は、文の文書
中での位置情報、タイトルや見出しに出現する単語、手
がかり語句の有無を組み合わせて文の重要度を判定し、
重要と判定された文を抽出し、重要文を組み合わせて要
約を作成する技術である。[0005] As a typical document summarizing technique, there is a summarizing technique using heuristics. This heuristic-based summarization technique determines the importance of a sentence by combining position information in the document, words appearing in titles and headings, and the presence or absence of clue words,
This is a technique for extracting sentences determined to be important and combining the important sentences to create a summary.

【０００６】この技術は、予めヒューリスティックスを
用意しておくことが必要である。このため、新たな情報
の出現する文書の要約には、不向きである。また、ある
文書タイプで有効なヒューリスティックスが別の文書タ
イプで有効であるとは限らない。例えば、新聞記事など
では、位置情報が有効である。また、学術論文では、手
がかり語句によるヒューリスティックスが有効である。This technique requires that heuristics be prepared in advance. Therefore, it is not suitable for summarizing a document in which new information appears. Also, heuristics that are valid for one document type are not necessarily valid for another document type. For example, position information is effective for newspaper articles and the like. In academic papers, heuristics using clue words are effective.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、インタ
ーネット上には、様々なタイプの文書が混在しており、
文書タイプを自動的に判定することが必要となる。しか
し、現状では、文書タイプを判別する有効な技術がない
という問題がある。また、タイトルや見出しが存在しな
い文書や、手がかり語句が殆ど出現しない文書も多いた
め、ヒューリスティックスが有効に働かないことが多い
という問題がある。However, various types of documents are mixed on the Internet.
It is necessary to determine the document type automatically. However, at present, there is a problem that there is no effective technique for determining the document type. In addition, since there are many documents in which titles and headings do not exist and in which there are few clues and phrases, there is a problem that heuristics often do not work effectively.

【０００８】本発明は、上記の点に鑑みなされたもの
で、予めヒューリスティックスなどを用意することな
く、新たな情報の出現する文書の要約も可能である文書
要約方法及び装置及び文書要約プログラムを格納した記
憶媒体を提供することを目的とする。The present invention has been made in view of the above points, and stores a document summarizing method and apparatus and a document summarizing program which can summarize a document in which new information appears without preparing heuristics or the like in advance. It is an object of the present invention to provide a storage medium that has been used.

【０００９】[0009]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明（請求項１）は、文書
データを記憶した入力記憶手段から読み出される対象文
書を特徴付ける情報を抽出し、要約する文書要約方法に
おいて、要約対象の対象文書を、処理対象となり得る文
書集合全体である標準文書集合と比較し、該対象文書を
特徴付ける特徴情報を対象文書中から算出し（ステップ
１）、特徴情報に基づいて、対象文書を最も特徴付ける
部分を該対象文書から抽出し（ステップ２）、抽出され
た部分を対象文書を特徴付ける要約として出力する（ス
テップ３）。FIG. 1 is a diagram for explaining the principle of the present invention. According to a first aspect of the present invention, in a document summarizing method for extracting and characterizing information that characterizes a target document read from an input storage unit that stores document data, the target document to be summarized is converted into an entire document set that can be processed. Is compared with the standard document set, and the characteristic information characterizing the target document is calculated from the target document (step 1). Based on the characteristic information, the part that characterizes the target document is extracted from the target document (step 2). ), And outputs the extracted portion as a summary characterizing the target document (step 3).

【００１０】図２は、本発明の原理構成図である。本発
明（請求項２）は、文書データを記憶した入力記憶手段
２０から読み出される対象文書を特徴付ける情報を抽出
し、要約する文書要約装置であって、要約対象の対象文
書を、処理対象となり得る文書集合全体である標準文書
集合と比較し、該対象文書を特徴付ける特徴情報を対象
文書中から算出する特徴情報算出手段４０と、特徴情報
に基づいて、対象文書を最も特徴付ける部分を該対象文
書から抽出する特徴部分抽出手段５０と、特徴部分抽出
手段５０で抽出された部分を対象文書を特徴付ける要約
として出力する要約作成手段６０とを有する。FIG. 2 is a diagram showing the principle of the present invention. The present invention (claim 2) is a document summarizing apparatus that extracts information characterizing a target document read from an input storage unit 20 storing document data and summarizes the document, and the target document to be summarized can be processed. A feature information calculating unit 40 for comparing the standard document set which is the entire document set and calculating feature information for characterizing the target document from the target document; and, based on the feature information, a part which most characterizes the target document from the target document. It has a characteristic part extracting means 50 to be extracted, and a summary creating means 60 for outputting the part extracted by the characteristic part extracting means 50 as a summary characterizing the target document.

【００１１】本発明（請求項３）は、入力記憶手段から
標準文書集合を取得する標準文書集合更新手段と、標準
文書集合更新手段に与えられた標準文書集合中の各文書
を解析し、該文書を構成する単語と該単語の出現頻度を
算出する標準文書集合解析手段と、標準文書集合中の単
語と該単語の出現頻度を対応付けて記憶する標準文書集
合解析結果記憶手段と、入力記憶手段から対象文書を受
け取る対象文書入力手段と、対象文書を解析し、該対象
文書を構成する単語と該単語の出現頻度を算出する対象
文書解析手段と、対象文書中の単語と各単語の出現頻度
を対応付けて記憶する対象文書解析結果記憶手段と、対
象文書中の特徴情報を、標準文書集合解析結果記憶手段
に記憶されている標準文書解析結果と、対象文書解析結
果記憶手段に記憶されている対象文書解析結果を用いて
算出する対象文書特徴算出手段と、対象文書特徴算出手
段によって算出された対象文書の特徴情報を記憶する対
象文書特徴記憶手段と、特徴情報記憶手段に記憶されて
いる対象文書の特徴情報を用いて、対象文書から特徴表
現を抽出する特徴表現抽出手段と、特徴表現抽出手段に
より対象文書から抽出された特徴表現を記憶する特徴表
現記憶手段と、特徴表現記憶手段に記憶されている特徴
表現を転送媒体に与える特徴表現出力手段とを有する。According to the present invention (claim 3), a standard document set updating means for acquiring a standard document set from an input storage means, and each document in the standard document set given to the standard document set updating means are analyzed. Standard document set analysis means for calculating words constituting a document and the appearance frequency of the word; standard document set analysis result storage means for storing words in the standard document set and the appearance frequency of the word in association with each other; A target document inputting means for receiving a target document from the means, a target document analyzing means for analyzing the target document and calculating words constituting the target document and an appearance frequency of the word, a word in the target document and an appearance of each word A target document analysis result storage unit that stores frequencies in association with each other, and characteristic information in the target document is stored in a standard document analysis result stored in the standard document set analysis result storage unit and a target document analysis result storage unit. A target document feature calculation unit that calculates the target document using the analysis result of the target document, a target document feature storage unit that stores feature information of the target document calculated by the target document feature calculation unit, and a feature information storage unit that stores the feature information. Feature expression extraction means for extracting a feature expression from a target document by using feature information of the target document, feature expression storage means for storing a feature expression extracted from the target document by the feature expression extraction means, and feature expression storage Means for outputting a feature expression stored in the means to the transfer medium.

【００１２】本発明（請求項４）は、特徴情報算出手段
において、特徴情報の算出に、χ２乗検定を用いる手段
を含む。本発明（請求項５）は、対象文書特徴算出手段
において、標準文書集合解析結果と対象文書解析結果を
参照して求める特徴情報スコアの算出単位として、単
語、文字または、一定長の文字列を用いる。According to the present invention (claim 4), the characteristic information calculating means includes means for using the chi-square test for calculating the characteristic information. According to the present invention (claim 5), in the target document feature calculation means, a word, a character, or a character string having a certain length is used as a calculation unit of the feature information score obtained by referring to the standard document set analysis result and the target document analysis result. Used.

【００１３】本発明（請求項６）は、文書データを記憶
した入力記憶手段から読み出される対象文書を特徴付け
る情報を抽出し、要約する文書要約プログラムを格納し
た記憶媒体であって、要約対象の対象文書を、処理対象
となり得る文書集合全体である標準文書集合と比較し、
該対象文書を特徴付ける特徴情報を対象文書中から算出
する特徴情報算出プロセスと、特徴情報に基づいて、対
象文書を最も特徴付ける部分を該対象文書から抽出する
特徴部分抽出プロセスと、特徴部分抽出プロセスで抽出
された部分を対象文書を特徴付ける要約として出力する
要約作成プロセスとを有する。According to a sixth aspect of the present invention, there is provided a storage medium storing a document summarizing program for extracting and characterizing information characterizing a target document read from an input storage unit storing document data, the program comprising: Compare the document with a standard document set, which is the entire set of documents that can be processed,
A feature information calculation process of calculating feature information characterizing the target document from the target document, a feature portion extraction process of extracting the most characteristic portion of the target document from the target document based on the feature information, and a feature portion extraction process. A summary creation process for outputting the extracted portion as a summary characterizing the target document.

【００１４】本発明（請求項７）は、入力記憶手段から
標準文書集合を取得する標準文書集合更新プロセスと、
標準文書集合更新プロセスに与えられた標準文書集合中
の各文書を解析し、該文書を構成する単語と該単語の出
現頻度を算出し、標準文書集合解析結果記憶手段に、該
標準文書集合中の単語と該単語の出現頻度を対応付けて
記憶する標準文書集合解析プロセスと、入力記憶手段か
ら対象文書を受け取る対象文書入力プロセスと、対象文
書を解析し、該対象文書を構成する単語と該単語の出現
頻度を算出し、対象文書解析結果記憶手段に該対象文書
中の単語と該各単語の出現頻度を対応付けて記憶するす
る対象文書解析プロセスと、対象文書中の特徴情報を、
標準文書集合解析結果記憶手段に記憶されている標準文
書解析結果と、対象文書解析結果記憶手段に記憶されて
いる対象文書解析結果を用いて算出し、対象文書特徴記
憶手段に、算出された対象文書の特徴情報を記憶させる
対象文書特徴算出プロセスと、特徴情報記憶手段に記憶
されている対象文書の特徴情報を用いて、対象文書から
特徴表現を抽出し、該特徴表現を特徴表現記憶手段に記
憶する特徴表現抽出プロセスと、特徴表現記憶手段に記
憶されている特徴表現を転送媒体に与える特徴表現出力
プロセスとを有する。According to the present invention (claim 7), a standard document set updating process for acquiring a standard document set from an input storage means;
Each document in the standard document set given to the standard document set update process is analyzed, the words constituting the document and the frequency of occurrence of the words are calculated, and the standard document set analysis result storage means stores A standard document set analysis process for storing the words in association with the frequency of occurrence of the word, a target document input process for receiving the target document from the input storage means, and analyzing the target document to determine the words constituting the target document and the A target document analysis process of calculating the frequency of occurrence of a word and storing the word in the target document and the frequency of occurrence of each word in the target document analysis result storage means in association with each other;
The target document is calculated using the standard document analysis result stored in the standard document set analysis result storage unit and the target document analysis result stored in the target document analysis result storage unit, and the calculated target is stored in the target document feature storage unit. A feature expression is extracted from the target document by using the target document feature calculation process for storing the feature information of the document and the feature information of the target document stored in the feature information storage unit, and the feature expression is stored in the feature expression storage unit. It has a feature expression extraction process for storing, and a feature expression output process for providing the feature expression stored in the feature expression storage means to the transfer medium.

【００１５】上記のように、本発明では、対象文書を標
準文書集合と比較し、対象文書を特徴付ける情報を対象
文書中から算出し、当該特徴情報に基づいて、対象文書
を最も特徴付ける部分を対象文書から抽出し、抽出され
た部分を当該文書を特徴付ける要約として出力すること
により、大量の文書に対して、各文書を特徴付ける要約
を適時に適質適量付与可能となる。また、予め文書要約
に使用する知識等を用意する必要がなく、様々な文書の
要約に適用可能である。As described above, according to the present invention, a target document is compared with a standard document set, information characterizing the target document is calculated from the target document, and a part that characterizes the target document is determined based on the characteristic information. By extracting from a document and outputting the extracted portion as a summary that characterizes the document, it is possible to provide a large amount of documents with a timely and appropriate amount of summary characterizing each document. Further, it is not necessary to prepare knowledge to be used for document summarization in advance, and the present invention can be applied to summarization of various documents.

【００１６】[0016]

【発明の実施の形態】図３は、本発明の文書要約装置の
構成を示す。同図に示す文書要約装置は、監視制御部１
０に、入力記憶装置２０、転送媒体３０、標準文書集合
更新部１０１、標準文書集合解析部１０２、対象文書入
力部１０３、対象文書解析部１０４、対象文書特徴算出
部１０５、特徴表現抽出部１０６、特徴表現出力部１０
７、標準文書集合解析結果記憶部２０１、対象文書解析
結果記憶部２０２、対象文書特徴記憶部２０３、及び特
徴表現記憶部２０４が接続された構成である。FIG. 3 shows the configuration of a document summarizing apparatus according to the present invention. The document summarizing apparatus shown in FIG.
0, the input storage device 20, the transfer medium 30, the standard document set update unit 101, the standard document set analysis unit 102, the target document input unit 103, the target document analysis unit 104, the target document feature calculation unit 105, and the feature expression extraction unit 106 , Feature expression output unit 10
7, a standard document set analysis result storage unit 201, a target document analysis result storage unit 202, a target document feature storage unit 203, and a feature expression storage unit 204 are connected.

【００１７】ここで、各処理部１０１〜１０７は、例え
ば、ディジタル電子計算機で構成され、それぞれＣＰＵ
と、動作プログラムとそれを実行するためのデータを記
憶するＲＯＭと、ワーキングメモリとして用いられるＲ
ＡＭとを備える。なお、全処理部を１つのディジタル電
子計算機で構成してもよい。また、各記憶部２０１〜２
０４は、例えば、ハードディスクメモリなどのメモリに
記憶される。Here, each of the processing units 101 to 107 is composed of, for example, a digital computer,
ROM for storing an operation program and data for executing the operation program, and R used as a working memory
AM. Note that all the processing units may be constituted by one digital computer. In addition, each of the storage units 201 and 2
04 is stored in a memory such as a hard disk memory, for example.

【００１８】入力記憶部２０には、本装置に与えられる
標準文書集合、対象文書が一定の順序で記憶されてい
る。入力記憶部２０は、半導体メモリ装置、あるいは、
ハードディスクやフロッピーディスクによって実現する
ことができる。転送媒体３０には、本装置の処理結果が
与えられる通信チャネルまたは、記録媒体である。以下
の説明において、「標準文書集合」とは、本装置の処理
対象となり得る文書集合全体、または、文書集合を母集
合とする標準文書集合を指す。また、「対象文書」と
は、要約の対象の文書を指す。「特徴表現」とは、「対
象文書」から抽出される表現であり、当該「対象文書」
を特徴付ける表現である。The input storage unit 20 stores standard document sets and target documents provided to the apparatus in a certain order. The input storage unit 20 includes a semiconductor memory device or
It can be realized by a hard disk or a floppy disk. The transfer medium 30 is a communication channel or a recording medium to which the processing result of the present apparatus is given. In the following description, the “standard document set” refers to an entire document set that can be processed by the present apparatus or a standard document set having the document set as a parent set. The “target document” indicates a document to be summarized. The “feature expression” is an expression extracted from the “target document”, and the “target document”
Is an expression that characterizes

【００１９】標準文書集合更新部１０１は、入力記憶装
置２０から標準文書集合を受け取る。標準文書集合解析
部１０２は、標準文書集合更新部１０１に与えられた標
準文書集合中の各文書を解析し、その文書を構成する単
語とその単語の出現頻度を算出する。The standard document set updating unit 101 receives a standard document set from the input storage device 20. The standard document set analyzing unit 102 analyzes each document in the standard document set given to the standard document set updating unit 101, and calculates the words constituting the document and the appearance frequency of the words.

【００２０】標準文書集合解析結果記憶部２０１は、標
準文書集合解析部１０２で求められた標準文書中の単語
とその単語の出現頻度を対応付けて記憶する。対象文書
入力部１０３は、入力記憶装置２０から要約の対象文書
を受け取り、対象文書解析部１０４に渡す。対象文書解
析部１０４は、対象文書入力部１０３から渡された対象
文書を解析し、その文書を構成する単語とその単語の出
現頻度を算出する。The standard document set analysis result storage unit 201 stores the words in the standard document obtained by the standard document set analysis unit 102 in association with the appearance frequencies of the words. The target document input unit 103 receives the target document to be summarized from the input storage device 20 and passes it to the target document analysis unit 104. The target document analysis unit 104 analyzes the target document passed from the target document input unit 103, and calculates words constituting the document and the appearance frequency of the words.

【００２１】対象文書解析結果記憶部２０２は、対象文
書解析部１０４において求められた対象文書中の単語の
出現頻度をその文書と対応付けて記憶する。対象文書特
徴算出部１０５は、対象文書中の特徴情報を、上記の標
準文書集合解析結果記憶部２０１に記憶されている標準
文書解析結果と、上記の対象文書解析結果記憶部２０２
に記憶されている対象文書解析結果を用いて算出する。The target document analysis result storage unit 202 stores the frequency of occurrence of a word in the target document obtained by the target document analysis unit 104 in association with the document. The target document feature calculation unit 105 compares the feature information in the target document with the standard document analysis result stored in the standard document set analysis result storage unit 201 and the target document analysis result storage unit 202
Is calculated using the target document analysis result stored in the.

【００２２】対象文書特徴記憶部２０３は、対象文書特
徴算出部１０５によって算出された対象文書の特徴情報
を記録する。特徴表現抽出部１０６は、対象文書特徴記
憶部２０３に記憶されている対象文書の特徴情報を用い
て、対象文書から特徴表現を抽出する。特徴表現記憶部
２０４は、特徴表現抽出部１０６により対象文書から抽
出された特徴表現を記憶する。The target document feature storage unit 203 records the feature information of the target document calculated by the target document feature calculation unit 105. The characteristic expression extraction unit 106 extracts a characteristic expression from the target document using the characteristic information of the target document stored in the target document characteristic storage unit 203. The characteristic expression storage unit 204 stores the characteristic expressions extracted from the target document by the characteristic expression extraction unit 106.

【００２３】特徴表現出力部１０７は、特徴表現記憶部
２０４に記憶されている特徴表現を転送媒体３０に与え
る。The feature expression output unit 107 gives the feature expression stored in the feature expression storage unit 204 to the transfer medium 30.

【００２４】[0024]

【実施例】以下、図面と共に本発明の実施例を説明す
る。まず、図３に示す監視制御部１０に接続される標準
文書集合解析結果記憶部２０１、対象文書解析結果記憶
部２０２、対象文書特徴記憶部２０３、及び特徴表現記
憶部２０４について説明する。Embodiments of the present invention will be described below with reference to the drawings. First, the standard document set analysis result storage unit 201, target document analysis result storage unit 202, target document feature storage unit 203, and feature expression storage unit 204 connected to the monitoring control unit 10 shown in FIG. 3 will be described.

【００２５】標準文書集合解析結果記憶部２０１は、標
準文書集合解析部１０２の処理結果である、標準文書集
合の解析結果を記憶・保持する。ここで、「標準文書集
合」とは、本装置が接続される情報検索装置が検索対象
とする文書全体集合、または、この文書集合の標準集合
である。「標準文書集合の解析結果」とは、標準文書集
合中の全文書に記述されている文章を形態素解析し、各
単語の表現及び、各単語出現頻度を対応付けたものであ
る。「解析結果」は、単語表現、出現頻度の２つのカラ
ムからなるテーブルとして、表現、記憶、保持される。
このテーブルにおいて、各行は、各単語表現とその単語
の出現頻度の対応関係を表す。このテーブルは、各単語
表現をキーとして対応する行を検索できる構造をとる。The standard document set analysis result storage unit 201 stores and holds a standard document set analysis result, which is a processing result of the standard document set analysis unit 102. Here, the “standard document set” is the entire document set to be searched by the information search device to which the present apparatus is connected, or the standard set of this document set. The “analysis result of the standard document set” is obtained by morphologically analyzing the sentences described in all the documents in the standard document set, and correlating the expression of each word with the frequency of appearance of each word. The “analysis result” is expressed, stored, and held as a table including two columns of a word expression and an appearance frequency.
In this table, each row represents the correspondence between each word expression and the appearance frequency of that word. This table has a structure in which a corresponding row can be searched using each word expression as a key.

【００２６】対象文書解析結果記憶部２０２は、対象文
書解析部１０４の処理結果として得られる対象文書の解
析結果を記憶・保持する。ここで、「対象文書」とは、
文書要約の対象となる文書である。「対象文書の解析結
果」とは、対象文書中の文章を形態素解析し、各単語の
表現と対象文書中での各単語出現頻度を対応付けたもの
である。「解析結果」は、標準文書集合の解析結果と同
様の形式であり、単語表現、出現頻度の２つのカラムか
らなるテーブルとして表現、記憶・保持される。このテ
ーブルにおいて、各行は、各単語表現とその単語の出現
頻度の対応関係を表す。このテーブルは、各単語表現を
キーとして対応する行を検索できる構造をとる。The target document analysis result storage unit 202 stores and holds the analysis result of the target document obtained as the processing result of the target document analysis unit 104. Here, "target document"
This is the document to be summarized. The "analysis result of the target document" is obtained by morphologically analyzing a sentence in the target document and associating each word expression with each word appearance frequency in the target document. The “analysis result” has the same format as the analysis result of the standard document set, and is expressed, stored, and held as a table including two columns of a word expression and an appearance frequency. In this table, each row represents the correspondence between each word expression and the appearance frequency of that word. This table has a structure in which a corresponding row can be searched using each word expression as a key.

【００２７】対象文書特徴記憶部２０４は、対象文書特
徴算出部１０５の処理結果として得られる、標準文書集
合に対する対象文書の特徴を点数化した情報を記憶・保
持する。ここで、「対象文書の特徴」とは、標準文書集
合中の単語の出現頻度分布と対象文書中の単語の出現頻
度分布を比較し、その分布の相違の大小を各単語毎に数
値化したものであり、標準文書集合中の出現頻度分布と
対象文書中の出現頻度分布の相違が大きい単語ほど大き
な数値をとる。対象文書中で特徴的な単語、即ち、標準
文書集合中の出現頻度分布と対象文書中の出現分布の相
違が大きい単語ほど大きな数値をとる。この「対象文書
の特徴」は、各単語表現とその単語の出現分布の特徴を
数値化した特徴スコアの２つのカラムからなるテーブル
として表現される。各行は、各単語の表現と特徴スコア
の対応を表す。対象文書一文書に対して１つのテーブル
が対応する。The target document feature storage unit 204 stores and retains information obtained as a result of processing by the target document feature calculation unit 105 in which the features of the target document with respect to the standard document set are scored. Here, the "characteristics of the target document" means that the frequency distribution of words in the standard document set and the frequency distribution of words in the target document are compared, and the magnitude of the difference in the distribution is quantified for each word. The larger the difference between the appearance frequency distribution in the standard document set and the appearance frequency distribution in the target document, the larger the numerical value. A characteristic word in the target document, that is, a word having a larger difference between the appearance frequency distribution in the standard document set and the appearance distribution in the target document has a larger numerical value. The “feature of the target document” is expressed as a table including two columns of each word expression and a feature score obtained by quantifying the feature of the appearance distribution of the word. Each line represents the correspondence between the expression of each word and the feature score. One table corresponds to one target document.

【００２８】特徴表現記憶部２０４は、特徴表現抽出部
１０６の処理結果として得られる、対象文書中の特徴表
現をその情報の抽出元である文書と対応付けて記憶・保
持する。対象文書中の特徴表現とは、対象文書に含まれ
る、予め決められた単語数の連続した単語列、または、
予め決められた数の文、あるいは、予め決められた部分
を構成する単語列であり、その連続する単語列、文、部
分を構成する単語列を構成する単語全体の特徴スコアの
平均が最大の部分である。The feature expression storage unit 204 stores and holds the feature expressions in the target document obtained as a result of processing by the feature expression extraction unit 106 in association with the document from which the information is extracted. The characteristic expression in the target document is a word string that is included in the target document and has a predetermined number of words, or
A predetermined number of sentences, or a word string constituting a predetermined part, and the average of the feature scores of all the words constituting the continuous word string, the sentence, and the word string constituting the part is the largest. Part.

【００２９】次に、各処理部について説明する。監視制
御部１０は、処理部１０１〜１０７を制御し、データフ
ローを統制するモジュールである。図４は、本発明の一
実施例の監視制御部によって実行される監視制御処理の
フローチャートである。Next, each processing unit will be described. The monitoring control unit 10 is a module that controls the processing units 101 to 107 and controls a data flow. FIG. 4 is a flowchart of the monitoring control process executed by the monitoring control unit according to one embodiment of the present invention.

【００３０】以下同図に基づいて処理を説明する。ステップ１０１）標準文書集合が更新されているか否
かが判断される。更新された場合には、ステップ１０２
に移行し、更新されていない場合は、ステップ１０１の
処理を繰り返す。ステップ１０２）更新された標準文書集合を入力記憶
装置２０から標準文書集合更新部１０１へ転送する。こ
の時点で、標準文書集合更新部１０１は、転送された標
準文書集合に対して、標準文書集合更新処理を実行し、
処理結果である標準文書集合更新結果を監視制御部１０
に出力する。The processing will be described below with reference to FIG. Step 101) It is determined whether or not the standard document set has been updated. If updated, step 102
Then, if it is not updated, the process of step 101 is repeated. Step 102) The updated standard document set is transferred from the input storage device 20 to the standard document set update unit 101. At this point, the standard document set update unit 101 executes a standard document set update process on the transferred standard document set,
The monitoring control unit 10 receives the standard document set update result as the processing result.
Output to

【００３１】ステップ１０３）標準文書集合更新部１
０１から出力されたすべての標準文書集合更新結果を標
準文書集合解析部１０２へ転送する。このとき、標準文
書集合解析部１０２は、標準文書集合解析処理を実行
し、処理結果を監視制御部１０へ出力する。ステップ１０４）標準文書集合解析部１０２から出力
されたすべての標準文書集合解析結果を標準文書集合解
析結果記憶部２０１に転送し、その内容を更新し、新た
に転送された値を記憶・保持する。Step 103) Standard document set updating unit 1
The standard document set analysis unit 102 transfers all the standard document set update results output from the standard document set 01. At this time, the standard document set analysis unit 102 executes the standard document set analysis processing, and outputs the processing result to the monitoring control unit 10. Step 104) Transfer all the standard document set analysis results output from the standard document set analysis unit 102 to the standard document set analysis result storage unit 201, update the contents, and store and hold the newly transferred values. .

【００３２】ステップ１０５）対象文書が入力された
か否かが判断される。入力された場合は、ステップ１０
６へ移行し、入力されていない場合には、ステップ１０
５の処理を繰り返す。ステップ１０６）入力された対象文書を入力記憶装置
２０から対象文書入力部１０３へ転送する。対象文書入
力部１０３は、入力された対象文書に対して対象文書入
力処理を実行し、処理結果を監視制御部１０に出力す
る。Step 105) It is determined whether or not the target document has been input. If input, step 10
The process proceeds to step 6 and if not entered, step 10
Step 5 is repeated. Step 106) The input target document is transferred from the input storage device 20 to the target document input unit 103. The target document input unit 103 executes a target document input process on the input target document, and outputs a processing result to the monitoring control unit 10.

【００３３】ステップ１０７）対象文書入力部１０３
から出力される対象文書入力処理結果を対象文書解析部
１０４へ転送する。このとき、対象文書解析部１０４
は、対象文書解析処理を実行し、解析結果を監視制御部
１０へ出力する。ステップ１０８）対象文書解析部１０４から出力され
た対象文書解析結果を対象文書解析結果記憶部２０２に
転送し、当該記憶部の内容を更新し、新たに転送された
値を記憶・保持する。Step 107) Target document input unit 103
Is transferred to the target document analysis unit 104. At this time, the target document analysis unit 104
Executes the target document analysis process and outputs the analysis result to the monitoring control unit 10. Step 108) The target document analysis result output from the target document analysis unit 104 is transferred to the target document analysis result storage unit 202, the content of the storage unit is updated, and the newly transferred value is stored and held.

【００３４】ステップ１０９）標準文書集合解析結果
記憶部２０１に記憶されている標準文書集合解析結果を
対象文書特徴算出部１０５に転送すると共に、対象文書
解析結果記憶部２０２に記憶されている対象文書解析結
果を対象文書特徴算出部１０５へ転送する。対象文書特
徴算出部１０５は、転送されてくる標準文書集合解析結
果と対象文書解析結果に基づいて対象文書特徴算出処理
を実行する。処理結果は、監視制御部１０に出力され
る。Step 109) The standard document set analysis result stored in the standard document set analysis result storage unit 201 is transferred to the target document feature calculation unit 105, and the target document stored in the target document analysis result storage unit 202 is stored. The analysis result is transferred to the target document feature calculation unit 105. The target document feature calculation unit 105 executes a target document feature calculation process based on the transferred standard document set analysis result and the target document analysis result. The processing result is output to the monitoring control unit 10.

【００３５】ステップ１１０）対象文書特徴算出部１
０５から出力される対象文書特徴を対象文書特徴記憶部
２０３に転送し、その内容を更新し、新たに転送された
値を記憶・保持する。ステップ１１１）対象文書特徴記憶部２０３に記憶さ
れている対象文書特徴を特徴表現抽出部１０６へ転送す
る。このとき、特徴表現抽出部１０６は、対象文書特徴
に基づいて特徴表現抽出処理を実行する。処理結果であ
る特徴表現は、監視制御部１０へ出力される。Step 110) Target document feature calculation unit 1
The target document feature output from step 05 is transferred to the target document feature storage unit 203, the content is updated, and the newly transferred value is stored and held. Step 111) Transfer the target document features stored in the target document feature storage unit 203 to the feature expression extraction unit 106. At this time, the feature expression extraction unit 106 performs a feature expression extraction process based on the target document features. The feature expression as the processing result is output to the monitoring control unit 10.

【００３６】ステップ１１２）特徴表現抽出部１０６
から出力された特徴表現を特徴表現記憶部２０４へ転送
し、その内容を更新し、新たに転送された値を記憶・保
持する。ステップ１１３）特徴表現記憶部２０４に記憶されて
いる特徴表現を特徴表現出力部１０７へ転送する。この
とき、特徴表現出力部１０７は、転送されてきた特徴表
現について特徴表現出力処理を実行する。処理結果は、
監視制御部１０に出力される。Step 112) Feature expression extraction unit 106
Is transferred to the feature expression storage unit 204, its contents are updated, and the newly transferred values are stored and held. Step 113) The feature expression stored in the feature expression storage unit 204 is transferred to the feature expression output unit 107. At this time, the feature expression output unit 107 performs a feature expression output process on the transferred feature expression. The processing result is
It is output to the monitoring control unit 10.

【００３７】ステップ１１４）特報情報出力部１０７
から出力された特徴情報出力処理結果を転送媒体３０へ
出力する。ステップ１１５）すべての処理が終了か否かを判定
し、すべての処理が終了している場合には、当該監視制
御処理を終了する。また、終了していない場合には、ス
テップ１０１に移行し、上述の処理を繰り返す。Step 114) Special information output unit 107
Output to the transfer medium 30. Step 115) It is determined whether or not all the processes have been completed. If all the processes have been completed, the monitoring control process is completed. If the processing has not been completed, the process proceeds to step 101, and the above processing is repeated.

【００３８】標準文書集合更新部１０１では、監視制御
部１０から転送された標準文書集合に対して標準文書集
合更新処理が実行される。この処理は、以降の処理の前
処理であり、入力された標準文書集合中の各文書から本
装置による処理に必要のない部分を除去する。また、以
降の処理で対応している文字コードへ変換される。処理
結果は、監視制御部１０へ出力される。The standard document set updating unit 101 executes a standard document set updating process on the standard document set transferred from the monitoring control unit 10. This processing is a pre-processing of the subsequent processing, and removes a part unnecessary for processing by the present apparatus from each document in the input standard document set. Also, it is converted into a corresponding character code in the subsequent processing. The processing result is output to the monitoring control unit 10.

【００３９】標準文書集合解析部１０２では、監視制御
部１０から転送される標準文書集合更新処理結果に対し
て標準文書集合解析処理が実行される。この処理は、文
書毎に、その文書に記述されている文章を形態素解析
し、各単語の表現及び、転送されてきた標準文書集合中
での各単語の出現頻度を対応付けて記録するものであ
る。解析結果は、単語表現、出現頻度の２つのカラムか
らなるテーブルとして、監視制御部１０へ出力される。
このテーブルにおいて、各行には各単語表現及び、その
単語の出現頻度が記述される。また、このテーブルは、
各単語表現をキーとして対応するすべての行を検索でき
る構造をとる。The standard document set analysis unit 102 executes a standard document set analysis process on the standard document set update processing result transferred from the monitoring control unit 10. In this process, for each document, a sentence described in the document is morphologically analyzed, and the expression of each word and the frequency of occurrence of each word in the transferred standard document set are recorded in association with each other. is there. The analysis result is output to the monitoring control unit 10 as a table including two columns of a word expression and an appearance frequency.
In this table, each row describes each word expression and the appearance frequency of the word. This table also
The structure is such that all the corresponding lines can be searched using each word expression as a key.

【００４０】対象文書入力部１０３では、監視制御部１
０から転送されてくる対象文書に対して対象文書入力処
理が実行される。この処理は、以降の処理の前処理であ
り、転送されてきた対象文書中から本装置による処理に
必要のない部分を除去する。また、以降の処理で対応し
ている文字コードへ変換する。処理結果は、監視制御部
１０へ出力する。In the target document input unit 103, the monitoring control unit 1
The target document input process is executed for the target document transferred from 0. This processing is a pre-processing of the subsequent processing, and removes a part of the transferred target document that is not necessary for the processing by the present apparatus. In addition, it is converted into a corresponding character code in the subsequent processing. The processing result is output to the monitoring control unit 10.

【００４１】対象文書解析部１０４は、監視制御部１０
から転送されてくる対象文書入力処理結果に対して対象
文書解析処理が実行され、処理結果である対象文書解析
結果は、監視制御部１０に出力される。対象文書解析処
理は、まず、文書に記述されている文章を形態素解析
し、各単語の表現及び、その文書中での各単語出現頻度
を対応付け、文書毎のテーブルに記録する。対象文書一
文書に対して、１個のテーブルが作られる。このテーブ
ルが対象文書解析結果として監視制御部１０へ出力され
る。このテーブルは、単語表現を記録するカラムと、そ
の単語の出現頻度を記録するカラムの２つのカラムから
構成される。また、このテーブルは、各単語表現をキー
として対応するすべての行を検索できる構造をとる。The target document analysis unit 104 includes the monitoring control unit 10
The target document analysis processing is executed on the target document input processing result transferred from the server, and the target document analysis result as the processing result is output to the monitoring control unit 10. In the target document analysis process, first, a sentence described in a document is morphologically analyzed, the expression of each word is associated with the frequency of occurrence of each word in the document, and recorded in a table for each document. One table is created for one target document. This table is output to the monitoring control unit 10 as a target document analysis result. This table is composed of two columns, a column for recording a word expression and a column for recording the frequency of occurrence of the word. Further, this table has a structure in which all corresponding rows can be searched using each word expression as a key.

【００４２】対象文書特徴算出部１０５では、監視制御
部１０から転送されてくる、標準文書集合解析結果と対
象文書解析結果に基づいて対象文書特徴を算出する。
「対象文書特徴」とは、標準文書集合中の単語の出現頻
度分布と対象文書中の単語の出現頻度分布を比較し、そ
の分布の相違の大小を各単語毎に数値化したものであ
り、標準文書集合中の出現頻度分布と対象文書中の出現
頻度分布の相違が大きい単語ほど、大きな数値をとる。
即ち、対象文書に特徴的な単語ほどより大きな数値を持
つ。The target document feature calculation unit 105 calculates a target document feature based on the standard document set analysis result and the target document analysis result transferred from the monitoring control unit 10.
The “target document feature” is obtained by comparing the frequency distribution of words in the standard document set with the frequency distribution of words in the target document, and quantifying the difference in the distribution for each word. A word having a larger difference between the appearance frequency distribution in the standard document set and the appearance frequency distribution in the target document has a larger numerical value.
That is, words that are characteristic of the target document have larger numerical values.

【００４３】本実施例では、対象文書特徴は、各単語毎
にその単語の出現頻度分布に対して、χ２乗検定の考え
方を用いて算出する。χ２乗検定は、「いくつかの群
で、ある変数の分布に差があるかどうか」を検定するこ
とができる。本発明では、この変数を文書中の単語とす
る。対象文書と標準文書集合中の全単語の出現総数と対
象文書中の全単語の出現総数と標準文書集合中の全単語
の出現総数から計算される各単語の対象文書中での出現
頻度の期待値の分布を実際に観測される各単語の対象文
書中での出現頻度の分布からχ２乗値を算出する。この
値が大きくなるほど分布に差があることになり、そのよ
うな単語ほど偏って出現していることになる。本発明で
は、この値を用いて各単語の対象文書特徴を算出する。In this embodiment, the target document feature is calculated for each word by using the chi-square test for the appearance frequency distribution of the word. The chi-square test can test “whether there is a difference in the distribution of a certain variable in some groups”. In the present invention, this variable is a word in the document. Expectation of the appearance frequency of each word in the target document calculated from the total number of all words in the target document and the standard document set, the total number of all words in the target document, and the total number of all words in the standard document set The χ 2 value is calculated from the distribution of the frequency of occurrence of each word whose value distribution is actually observed in the target document. The larger this value is, the more the distribution is different, and the more such a word is, the more unevenly appearing. In the present invention, the target document feature of each word is calculated using this value.

【００４４】対象文書特徴は、各単語の表現とその単語
の特徴を数値化した特徴スコアの２つのカラムからなる
テーブルとして表現される。各行は、各単語表現と特徴
情報スコアの対応を表す。算出結果は、監視制御部１０
へ出力される。特徴表現抽出部１０６では、監視制御部
１０から転送されてくる対象文書特徴情報を用いて、対
象文書中からその文書に特徴的な表現（特徴表現）を抽
出し、監視制御部１０へ出力する。The target document feature is represented as a table including two columns of an expression of each word and a feature score obtained by digitizing the feature of the word. Each line represents the correspondence between each word expression and the feature information score. The calculation result is transmitted to the monitoring control unit 10
Output to The characteristic expression extraction unit 106 extracts characteristic expressions (characteristic expressions) of the target document from the target document using the target document characteristic information transferred from the monitoring control unit 10 and outputs the extracted expressions to the monitoring control unit 10. .

【００４５】対象文書中の特徴表現とは、文書に含まれ
る予め決められた単語数の連続した単語列、または、予
め決められた数の文、あるいは、予め決められた文書中
の部分構造（例えば、段落）を構成する単語列であり、
その連続する単語列、文、部分構造を構成する単語列に
含まれる各単語の特徴情報スコアの平均が最大の部分で
ある。The feature expression in the target document is a word string having a predetermined number of consecutive words included in the document, a predetermined number of sentences, or a predetermined partial structure in the document. For example, paragraphs)
The average of the feature information scores of the words included in the word strings constituting the continuous word strings, sentences, and partial structures is the largest part.

【００４６】特徴表現として、文書から一文を抽出する
場合、特徴表現抽出の手順は、次のようになる。まず、文書中の全単語に特徴情報スコアを付与す
る。次に、文書中の各文毎に、その文を構成する単語に
付与されている特徴情報スコアの平均を求める。When a sentence is extracted from a document as a feature expression, the procedure for extracting the feature expression is as follows. First, a feature information score is assigned to all words in the document. Next, for each sentence in the document, the average of the feature information scores assigned to the words constituting the sentence is determined.

【００４７】特徴情報スコアの平均が最大の文をそ
の文書の特徴表現として抽出する。特徴表現出力部１０
７では、監視制御部１０から転送されてくる、対象文書
に対する特徴表現を監視制御部１０を通して転送媒体３
０に出力する。図５は、本発明の一実施例の標準文書集
合を示し、図６は、本発明の一実施例の対象文書の一部
の例を示す。The sentence having the maximum feature information score is extracted as the feature expression of the document. Feature expression output unit 10
At 7, the characteristic expression for the target document transferred from the monitoring control unit 10 is transferred to the transfer medium 3 through the monitoring control unit 10.
Output to 0. FIG. 5 shows a standard document set according to an embodiment of the present invention, and FIG. 6 shows an example of a part of a target document according to an embodiment of the present invention.

【００４８】以下、図５に一部を示す標準文書集合と、
図６に示す対象文書が与えられると想定した場合の一処
理例を用いて本装置の動作を図４のフローチャートに沿
って説明する。まず、標準文書集合が更新されているか
否かが判定される（ステップ１０１）。ここでは、更新
されていることが判明したものとする。図５に一部を示
す標準文書集合が、標準文書集合更新部１０１に転送さ
れる。Hereinafter, a standard document set partially shown in FIG.
The operation of this apparatus will be described with reference to the flowchart of FIG. 4 using an example of a process when it is assumed that the target document shown in FIG. 6 is provided. First, it is determined whether the standard document set has been updated (step 101). Here, it is assumed that it has been found that it has been updated. The standard document set partially shown in FIG. 5 is transferred to the standard document set update unit 101.

【００４９】標準文書集合は、文書集合であり、その文
書数は十分に大きいことが望ましい。標準文書集合更新
部１０１では、標準文書集合中の各文書から以後の処理
に不要である部分が除去される。例えば、ＨＴＭＬ形式
の文書の場合は、ＨＴＭＬタグが除去される。また、ワ
ープロ文書の場合は、文字飾り等が除去される。さら
に、文書を構成している文字のコードがまちまちである
場合は、１つのコードに統一される。この結果は、監視
制御部１０に出力される（ステップ１０２）。The standard document set is a document set, and it is desirable that the number of documents is sufficiently large. In the standard document set updating unit 101, portions unnecessary for the subsequent processing are removed from each document in the standard document set. For example, in the case of an HTML document, the HTML tag is removed. In the case of a word processing document, character decoration and the like are removed. Further, when the codes of the characters constituting the document are various, they are unified into one code. This result is output to the monitoring control unit 10 (Step 102).

【００５０】監視制御部１０は、これを標準文書集合解
析部１０２に転送する。標準文書集合解析部１０２は、
転送されてきた標準文書集合を解析する。即ち、各文書
毎にその文書が記述している文書を形態素解析し、標準
文書集合中の単語表現とその単語の出現頻度を求める
（ステップ１０３）。図７は、本発明の一実施例の標準
文書集合解析結果の例を示す。当該結果は、監視制御部
１０に出力される。The monitoring control unit 10 transfers this to the standard document set analysis unit 102. The standard document set analysis unit 102
Analyze the transferred standard document set. That is, for each document, the document described by the document is subjected to morphological analysis, and the word expression in the standard document set and the appearance frequency of the word are obtained (step 103). FIG. 7 shows an example of a standard document set analysis result according to an embodiment of the present invention. The result is output to the monitoring control unit 10.

【００５１】監視制御部１０は、標準文書集合解析結果
を標準文書集合解析結果記憶部２０１に転送する。標準
文書集合解析結果記憶部２０１は、転送されてきた標準
文書集合解析結果を記憶・保持する（ステップ１０
４）。以上により、標準文書集合に関する処理が完了す
る。次に対象文書が入力されているか否かが判定される
（ステップ１０５）。入力されている場合は、以下のよ
うに処理が執行する。図６に示す対象文書が、対象文書
入力部１０３に入力される。対象文書入力部１０３は、
入力された文書集合から以降の処理に不要の部分を除去
する。また、以降の処理に対応する文字コードへ変換す
る。処理結果は、監視制御部１０へ出力される（ステッ
プ１０６）。The monitoring control unit 10 transfers the standard document set analysis result to the standard document set analysis result storage unit 201. The standard document set analysis result storage unit 201 stores and holds the transferred standard document set analysis result (step 10).
4). As described above, the process regarding the standard document set is completed. Next, it is determined whether or not the target document has been input (step 105). If it has been entered, the process is executed as follows. The target document shown in FIG. 6 is input to the target document input unit 103. The target document input unit 103
Unnecessary parts for subsequent processing are removed from the input document set. Also, it is converted into a character code corresponding to the subsequent processing. The processing result is output to the monitoring control unit 10 (Step 106).

【００５２】監視制御部１０は、対象文書入力部１０３
から出力された対象文書を対象文書解析部１０４に転送
する。対象文書解析部１０４では、転送された対象文書
を解析する。対象文書解析結果は、対象文書を構成する
単語表現と各単語の出現頻度を記録したテーブルであ
る。この結果の一部を図８に示す。この結果は監視制御
部１０に出力される（ステップ１０７）。The monitoring control unit 10 includes a target document input unit 103
Is transferred to the target document analysis unit 104. The target document analysis unit 104 analyzes the transferred target document. The target document analysis result is a table in which word expressions constituting the target document and the appearance frequency of each word are recorded. FIG. 8 shows a part of the result. This result is output to the monitoring controller 10 (step 107).

【００５３】監視制御１０は、対象文書解析結果を対象
文書解析結果記憶部２０２に転送する。対象文書解析
結果記憶部２０２は、転送されてきた対象文書解析結果
を記憶・保持する（ステップ１０８）。次に、対象文書
特徴を計算する。まず、監視制御部１０は、標準文書集
合解析結果記憶部２０１に記憶・保持されている標準文
書集合解析結果と、対象文書解析結果記憶部２０２に記
憶・保持されている対象文書解析結果を対象文書特徴算
出部１０５に転送する。The monitoring control 10 transfers the target document analysis result to the target document analysis result storage unit 202. The target document analysis result storage unit 202 stores and holds the transferred target document analysis result (step 108). Next, target document features are calculated. First, the monitoring control unit 10 analyzes the standard document set analysis result stored and held in the standard document set analysis result storage unit 201 and the target document analysis result stored and held in the target document analysis result storage unit 202. The document is transferred to the document feature calculation unit 105.

【００５４】対象文書特徴算出部１０５は、転送されて
きた標準文書集合解析結果と対象文書解析結果を参照し
て特徴情報スコアを算出する。算出結果は、監視制御部
１０に出力される（ステップ１０９）。結果の一部を図
９に示す。監視制御部１０は、対象文書特徴を対象文書
特徴記憶部２０３に転送する。対象文書特徴記憶部２０
３は、転送されてきた対象文書特徴を記憶・保持する
（ステップ１１０）。The target document feature calculation unit 105 calculates a feature information score by referring to the transferred standard document set analysis result and the target document analysis result. The calculation result is output to the monitoring control unit 10 (Step 109). Some of the results are shown in FIG. The monitoring control unit 10 transfers the target document feature to the target document feature storage unit 203. Target document feature storage unit 20
3 stores and holds the transferred target document features (step 110).

【００５５】次に、対象文書中の各文書から特徴表現を
抽出する。まず、監視制御部１０は、対象文書特徴記憶
部２０３に記憶・保持されている特徴情報を特徴表現抽
出部１０７に転送する。特徴表現抽出部１０７では、監
視制御部１０から転送されてきた特徴情報を用いて対象
文書から特徴表現を抽出し、その結果を監視制御部１０
へ出力する（ステップ１１１）。抽出結果を図１０に示
す。同図の例では、『２』、『戦』…、『報道』まで
が、１００以上の値を有しており、これらの特徴表現を
含む文が、『４月１日に…帯びている』となり、抽出さ
れる。Next, a characteristic expression is extracted from each document in the target document. First, the monitoring control unit 10 transfers the feature information stored and held in the target document feature storage unit 203 to the feature expression extraction unit 107. The feature expression extraction unit 107 extracts a feature expression from the target document by using the feature information transferred from the monitoring control unit 10, and compares the result with the monitoring control unit 10.
(Step 111). FIG. 10 shows the extraction result. In the example shown in the figure, “2”, “war”,..., And “report” have values of 100 or more, and a sentence including these feature expressions is “on April 1... ] Is extracted.

【００５６】監視制御部１０は、特徴表現抽出部１０７
の出力結果を特徴表現記憶部２０４に転送する。特徴表
現記憶部２０４は、転送されてきた特徴表現を記憶・保
持する（ステップ１１２）。特徴表現出力部１０７は、
特徴表現記憶部２０４に記憶されている特徴表現と、当
該特徴表現が抽出された文書と対応付けて特徴表現出力
処理を行う（ステップ１１３）。The monitoring control unit 10 includes a feature expression extraction unit 107
Is transferred to the feature expression storage unit 204. The feature expression storage unit 204 stores and holds the transferred feature expression (step 112). The feature expression output unit 107
A feature expression output process is performed in association with the feature expression stored in the feature expression storage unit 204 and the document from which the feature expression has been extracted (step 113).

【００５７】監視制御部１０は、特徴表現記憶部２０４
に記憶・保持されている特徴表現をそれが抽出された文
書と対応付けし、転送媒体３０へ出力する（ステップ１
１４）。これにより、転送媒体３０に出力されるのは、
図６に示す対象文書から図１０に示す内容（図６の上か
ら２行目から４行目の『４月１日に…帯びている。』）
となる。The monitoring control unit 10 includes a feature expression storage unit 204
Is associated with the document from which it is extracted and output to the transfer medium 30 (step 1).
14). Thus, what is output to the transfer medium 30 is
The contents shown in FIG. 10 from the target document shown in FIG. 6 (“April 1...” On the second to fourth lines from the top in FIG. 6)
Becomes

【００５８】以上の実施例において、種々の定義値を用
いているが、これらの値は設計値であり、下記のように
必要に応じて変更してもよい。・特徴情報の算出にχ２乗検定の考え方を用いている
が、他の手法で算出してもよい。・特徴スコアの算出単位として単語を用いたが、この単
位は文字や一定長の文字列でもよい。In the above embodiment, various defined values are used, but these values are design values, and may be changed as necessary as described below. -Although the concept of the chi-square test is used for calculating the characteristic information, it may be calculated by another method. The word is used as the unit for calculating the feature score, but this unit may be a character or a character string of a fixed length.

【００５９】また、上記の実施例では、図３の構成に基
づいて説明しているが、この例に限定されることなく、
図３に示す監視制御部１０、標準文書集合更新部１０
１、標準文書集合解析部１０２、対象文書入力部１０
３、対象文書解析部１０４、対象文書特徴算出部１０
５、特徴表現抽出部１０６及び特徴表現出力部１０７を
プログラムとして構築し、文書要約装置として利用され
るコンピュータに接続されるディスク装置や、フロッピ
ーディスクやＣＤ−ＲＯＭ等の可搬記憶媒体に格納して
おき、本発明を実施する際にインストールすることによ
り容易に本発明を実現できる。Although the above embodiment has been described based on the configuration of FIG. 3, the present invention is not limited to this example.
The monitoring control unit 10 and the standard document set updating unit 10 shown in FIG.
1. Standard document set analysis unit 102, target document input unit 10
3. Target document analysis unit 104, target document feature calculation unit 10
5. The feature expression extraction unit 106 and the feature expression output unit 107 are constructed as programs and stored in a disk device connected to a computer used as a document summarization device, or a portable storage medium such as a floppy disk or a CD-ROM. In addition, the present invention can be easily realized by installing the present invention when implementing the present invention.

【００６０】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内で種々変更・応用が可能
である。The present invention is not limited to the above embodiments, but can be variously modified and applied within the scope of the claims.

【００６１】[0061]

【発明の効果】上述のように、本発明によれば、予め特
徴情報抽出知識を用意することなく、特徴情報を抽出
し、要約として文書に付与することができる文書要約装
置を提供することが可能である。これにより、使用開始
時に想定したものと対象とする文書内容に差異が生じた
場合や、新たな情報を含んでいる場合においても適切に
要約することが可能となる。As described above, according to the present invention, it is possible to provide a document summarizing apparatus capable of extracting feature information and adding it to a document as a summary without preparing feature information extraction knowledge in advance. It is possible. As a result, it is possible to appropriately summarize even when there is a difference between the content assumed at the start of use and the content of the target document or when new information is included.

【００６２】また、文書集合中の各文書から各文書を特
徴付ける文書情報を要約として抽出できるので、文書検
索システムの出力編集装置に適用することにより、効率
的に検索結果の文書集合から文書を選択・閲覧すること
が可能となる。Further, since the document information characterizing each document can be extracted from each document in the document set as a summary, the document information can be efficiently selected from the document set of the search result by applying to the output editing device of the document search system.・ Can be browsed.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の文書要約装置の構成図である。FIG. 3 is a configuration diagram of a document summarizing apparatus of the present invention.

【図４】本発明の一実施例の監視制御部によって実行さ
れる監視制御処理のフローチャートである。FIG. 4 is a flowchart of a monitoring control process executed by the monitoring control unit according to one embodiment of the present invention.

【図５】本発明の一実施例の標準文書集合の例である。FIG. 5 is an example of a standard document set according to an embodiment of the present invention.

【図６】本発明の一実施例の対象文書の一部の例であ
る。FIG. 6 is an example of a part of a target document according to an embodiment of the present invention.

【図７】本発明の一実施例の標準文書集合解析結果の例
である。FIG. 7 is an example of a standard document set analysis result according to an embodiment of the present invention.

【図８】本発明の一実施例の対象文書解析結果の例であ
る。FIG. 8 is an example of a target document analysis result according to an embodiment of the present invention.

【図９】本発明の一実施例の対象文書特徴情報の一部の
例である。FIG. 9 is an example of a part of target document feature information according to an embodiment of the present invention.

【図１０】本発明の一実施例の特徴表現の例である。FIG. 10 is an example of a feature expression according to an embodiment of the present invention.

[Explanation of symbols]

１０監視制御部２０入力記憶装置、入力記憶手段３０転送媒体４０特徴情報算出手段５０特徴部分抽出手段６０要約作成手段１０１標準文書集合更新部１０２標準文書集合解析部１０３対象文書入力部１０４対象文書解析部１０５対象文書特徴算出部１０６特徴表現抽出部１０７特徴表現出力部２０１標準文書集合解析結果記憶部２０２対象文書解析結果記憶部２０３対象文書特徴記憶部２０４特徴表現記憶部 REFERENCE SIGNS LIST 10 monitoring control unit 20 input storage device, input storage means 30 transfer medium 40 feature information calculation means 50 feature part extraction means 60 summary creation means 101 standard document set update section 102 standard document set analysis section 103 target document input section 104 target document analysis Unit 105 target document feature calculation unit 106 feature expression extraction unit 107 feature expression output unit 201 standard document set analysis result storage unit 202 target document analysis result storage unit 203 target document feature storage unit 204 feature expression storage unit

Claims

[Claims]

1. A document summarizing method for extracting and characterizing information characterizing a target document read from an input storage unit storing document data, wherein the target document to be summarized is a standard document which is an entire set of documents that can be processed. Comparing the document with a set of documents, calculating characteristic information characterizing the target document from the target document, extracting, from the target document, a portion that most characterizes the target document based on the characteristic information; A document summarization method characterized by outputting as a summary characterizing a document.

2. A document summarization apparatus for extracting information characterizing a target document read from an input storage unit storing document data and summarizing the information, wherein the summarization target document is processed by a whole document set that can be processed. A feature information calculating means for comparing feature information that characterizes the target document from the target document by comparing with a standard document set; and a feature for extracting, from the target document, a part that most characterizes the target document based on the feature information. A document summarizing apparatus, comprising: a part extracting unit; and a summary creating unit that outputs a part extracted by the characteristic part extracting unit as a summary characterizing the target document.

3. A standard document set updating means for acquiring a standard document set from the input storage means; and analyzing each document in the standard document set given to the standard document set updating means to constitute the document. A standard document set analysis unit for calculating a word and an appearance frequency of the word; a standard document set analysis result storage unit for storing a word in the standard document set and an appearance frequency of the word in association with each other; A target document input unit that receives the target document; a target document analysis unit that analyzes the target document and calculates words constituting the target document and an appearance frequency of the word; and a word in the target document and each word. A target document analysis result storage unit that stores appearance frequencies in association with each other, and the feature information in the target document, the standard document analysis result stored in the standard document set analysis result storage unit,
A target document feature calculation unit that calculates using the target document analysis result stored in the target document analysis result storage unit; and a target document that stores feature information of the target document calculated by the target document feature calculation unit. A feature storage unit; a feature expression extraction unit that extracts a feature expression from the target document by using feature information of the target document stored in the feature information storage unit; 3. The document summarizing apparatus according to claim 2, further comprising: a feature expression storage unit configured to store the extracted feature expression; and a feature expression output unit configured to provide the feature expression stored in the feature expression storage unit to a transfer medium.

4. The document summarizing apparatus according to claim 3, wherein the feature information calculating unit includes a unit using a chi-square test for calculating the feature information.

5. The target document feature calculation means, as a unit for calculating a feature information score obtained by referring to the standard document set analysis result and the target document analysis result,
4. The document summarizing apparatus according to claim 3, wherein a character or a character string having a fixed length is used.

6. A storage medium storing a document summarization program for extracting and characterizing information characterizing a target document read from an input storage means storing document data, wherein the target document to be summarized is processed. A feature information calculating process of comparing the obtained document set with a standard document set and calculating feature information characterizing the target document from the target document; and, based on the feature information, a part that most characterizes the target document. A storage medium storing a document summarization program, comprising: a feature portion extraction process for extracting from a document; and a summary creation process for outputting a portion extracted in the feature portion extraction process as a summary characterizing the target document.

7. A standard document set updating process for acquiring a standard document set from the input storage means, and analyzing each document in the standard document set given to the standard document set updating process to constitute the document. A standard document set analysis process for calculating a word and an appearance frequency of the word, and storing the word in the standard document set and the appearance frequency of the word in the standard document set analysis result storage means in association with each other; A target document input process for receiving the target document from the target document, analyzing the target document, calculating the words constituting the target document and the appearance frequency of the word, and storing the words in the target document in the target document analysis result storage means. A target document analysis process for storing the frequency of occurrence of each of the words in association with each other; and the standard information stored in the standard document set analysis result storage means. And written analysis result,
A target document feature calculation process that calculates using the target document analysis result stored in the target document analysis result storage unit and causes the target document feature storage unit to store the calculated feature information of the target document; A feature expression extraction process of extracting a feature expression from the target document using feature information of the target document stored in the feature information storage unit, and storing the feature expression in the feature expression storage unit; 7. A storage medium storing the document summarizing program according to claim 6, further comprising: a characteristic expression output process of providing the characteristic expression stored in the means to a transfer medium.