JPH1145278A

JPH1145278A - Document processor, storage medium storing document processing program and document processing method

Info

Publication number: JPH1145278A
Application number: JP9217125A
Authority: JP
Inventors: Naoyuki Nomura; 直之野村; Shinji Fujisawa; 信二藤澤
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 1997-07-27
Filing date: 1997-07-27
Publication date: 1999-02-16
Anticipated expiration: 2017-07-27
Also published as: JP4025391B2

Abstract

PROBLEM TO BE SOLVED: To prepare a summary capable of accurately recognizing contents for respective topics by judging whether or not the plural topics are included in one document. SOLUTION: The document A is divided into plural sub documents (S14), document vectors (b) for the respective sub documents are obtained (S15) and the similarity degree (s) of the adjacent sub documents with each other is obtained from a cosine value between the adjacent document vectors (b) (S16). A break Xn predicted as the changing point of the topic is temporarily judged from the value of the similarity degree (s) (S17), a document vector B and the similarity degree S for a sub document group for the respective breaks are obtained (S19 and S20) and the changing point of the topic is finally judged from the similarity degree S (S21). Then, by preparing sub summaries for respective sub document groups finally judged as the changing point of the topic (S22) and synthesizing the sub summaries, the final summary of a summary object document A is attained.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、文書処理装置、
文書処理プログラムが記憶された記憶媒体、及び文書処
理方法に係り、詳細には、作成された文書に複数のトピ
ックが含まれるか否かの判定に関する。The present invention relates to a document processing apparatus,
The present invention relates to a storage medium storing a document processing program and a document processing method, and more particularly, to determining whether or not a created document includes a plurality of topics.

【０００２】[0002]

【従来の技術】従来、書籍、論文、報告書等の各種の文
書に対し、要約（抄録を含む）の自動作成処理や、他文
書等との関連づけ処理等の各種処理をコンピュータを用
いて行うことが行われている。文書の自動要約について
は、例えば、「全文情報からの意味的情報の抽出と加
工」（情報処理学会第３８回全国大会予稿集、第２２２
頁；１９８９年）で提案されている。この方法では、ま
ず文書中の重要語を字種や動詞等の情報から抽出し、さ
らに重要語の出現頻度から最重要語を決定する。次に重
要語と最重要語が出現するか否かから重要文を決定する
ことで、自動的に要約を作成することが可能になる。ま
た、文章の段落の性質を反映させることで、より正確に
要約を作成する特開平３−１９１４７５号公報に記載さ
れた方法等も提案されている。一方、他のデータとの関
連づけとしては、インターネットにおけるハイパーリン
クや、フレームシステム等による知識処理（エキスパー
トシステム等）における関連づけ等が行われている。2. Description of the Related Art Conventionally, various processes, such as automatic creation of summaries (including abstracts) and association with other documents, etc., are performed on various documents such as books, papers, and reports using a computer. That is being done. For automatic summarization of documents, for example, “Extraction and processing of semantic information from full-text information” (Information Processing Society of Japan 38th Annual Conference Proceedings, 222
1989). In this method, an important word in a document is first extracted from information such as a character type and a verb, and the most important word is determined from the appearance frequency of the important word. Next, by determining an important sentence from whether or not the important word and the most important word appear, it is possible to automatically create a summary. Further, a method described in Japanese Patent Application Laid-Open No. 3-191475 has been proposed in which a summary is created more accurately by reflecting the nature of a paragraph of a sentence. On the other hand, as for association with other data, a hyperlink on the Internet, an association in knowledge processing (expert system or the like) by a frame system or the like is performed.

【０００３】[0003]

【発明が解決しようとする課題】しかし、従来の文書処
理では、処理対象となっている文書中に複数のトピック
（論題）が含まれているか否かを判定することはでき
ず、全体として文書を処理していた。このため、上記し
た従来の自動要約の方法の場合、単一のトピックが含ま
れている文書に対しては比較的適切な要約を作成するこ
とが可能であるが、１文書中に複数のトピックを含む文
書に対しては必ずしも適切な要約を作成することができ
なかった。すなわち、互いに異なる主張や事実の記載を
もつ複数ユニットの存在を無視して重要文の文選択を行
い、互いにつなぎ合わせることで要約を作成していたた
め、可読性の低い要約を生成していた。また、複数のト
ピックを含む文書であっても文書全体に対してしか関連
づけることができなかった。However, in conventional document processing, it is not possible to determine whether or not a document to be processed includes a plurality of topics (topics). Was being processed. For this reason, in the case of the above-mentioned conventional automatic summarization method, it is possible to create a relatively appropriate summary for a document including a single topic, but a plurality of topics are included in one document. It was not always possible to create an appropriate summary for documents containing. In other words, sentences were selected for important sentences ignoring the existence of a plurality of units having different statements of assertions and facts, and summaries were created by connecting the sentences to each other, so that summaries with low readability were generated. Further, even a document including a plurality of topics can be related only to the entire document.

【０００４】本発明は、このような従来の課題を解決す
るために成されたもので、１文書中に複数のトピックを
含むか否か判定することができる文書処理装置を提供す
ることを第１の目的とする。また、本発明は、１文書中
に複数のトピックを含むか否かを判定することができる
コンピュータ読取り可能な文書処理プログラムが記憶さ
れた記憶媒体を提供することを第２の目的とする。ま
た、本発明は、１文書中に複数のトピックを含むか否か
を判定することができる文書処理方法を提供することを
第３の目的とする。The present invention has been made to solve such a conventional problem, and it is an object of the present invention to provide a document processing apparatus capable of determining whether or not one document includes a plurality of topics. This is the purpose of 1. It is a second object of the present invention to provide a storage medium storing a computer-readable document processing program capable of determining whether one document includes a plurality of topics. It is a third object of the present invention to provide a document processing method capable of determining whether one document includes a plurality of topics.

【０００５】[0005]

【課題を解決するための手段】請求項１記載の発明で
は、図１１に示すように、複数の文章で構成された所定
形式の文書を取得する文書取得手段１０１と、前記文書
取得手段１０１で取得された文書を複数のサブ文書に分
割する文書分割手段１０２と、前記文書分割手段１０２
により分割されたサブ文書間の類似度を算出する類似度
算出手段１０３と、前記類似度算出手段１０３で算出さ
れたサブ文書間の類似度から前記文書に複数のトピック
が含まれるか否かを判定する判定手段１０４と、を文書
処理装置に備えさせて前記第１の目的を達成する。請求
項２に記載した発明では、図１２に示すように、請求項
１に記載した文書処理装置において、前記文書分割手段
１０２で分割されたサブ文書を特徴づける文書ベクトル
を決定する文書ベクトル決定手段１０５を備え、前記類
似度算出手段１０３は前記文書ベクトル決定手段１０５
で決定されたサブ文書の文書ベクトルによりサブ文書間
の類似度を算出する。請求項３に記載した発明では、図
１１、図１２に示されるように、請求項１又は請求項２
に記載した文書処理装置において、前記判定手段１０４
は、前記類似度算出手段１０３で算出されたサブ文書間
の類似度からトピックの変わり目を仮判定し、前記類似
度算出手段１０３は、前記判定手段１０４で仮判定され
たトピックの変わり目により再分割したサブ文書群間の
類似度を更に算出し、前記判定手段１０４は、前記類似
度算出手段１０３で算出されたサブ文書群間の類似度か
ら前記文書に複数のトピックが含まれるか否かを判定す
る。請求項４に記載した発明では、図１３に示す１例の
ように、請求項１、請求項２、又は請求項３に記載した
文書処理装置において、複数の文書で構成される文書の
要約を自動的に作成する要約作成手段１０６を有し、前
記要約作成手段１０６は前記判定手段１０４により前記
文書に複数のトピックが含まれると判定された場合、ト
ピックを構成する単位で要約を作成する。請求項５に記
載した発明では、図１４に示す１例のように、請求項１
から請求項４のうちのいずれか１の請求項に記載した文
書処理装置において、所定のデータと他のデータとの関
連付けを行う関連付け手段１０７を有し、前記関連付け
手段１０７は前記判定手段１０４により判定されたトピ
ックを構成する単位で他のデータとの関連付けを行う。
請求項６に記載した発明では、図１１から図１４に示す
１例のように、請求項１から請求項５のうちのいずれか
１の請求項に記載した文書処理装置において、前記判定
手段により複数のトピックスが含まれないと判断された
場合、前記分割手段は、異なるサイズのサブ文書に再分
割し、前記類似度算出手段は、再分割後のサブ文書間の
類似度を再算出し、前記判定手段は、再算出後の類似度
から前記文書に複数のトピックが含まれるか否かを再判
定する。請求項７に記載した発明では、図１５に示すよ
うに、複数の文章で構成された所定形式の文書を取得す
る文書取得機能２０１と、前記文書取得機能２０１で取
得された文書を複数のサブ文書に分割する文書分割機能
２０２と、前記文書分割機能２０２により分割されたサ
ブ文書間の類似度を算出する類似度算出機能２０３と、
前記類似度算出機能２０３で算出されたサブ文書間の類
似度から前記文書に複数のトピックが含まれるか否かを
判定する判定機能２０４と、をコンピュータに実現させ
るためのコンピュータ読取り可能な文書処理プログラム
を記憶媒体に記憶させて前記第２の目的を達成する。請
求項８に記載した発明では、図１６に示すように、請求
項７に記載した記憶媒体において、前記文書分割機能２
０２で分割されたサブ文書を特徴づける文書ベクトルを
決定する文書ベクトル決定機能２０５を備え、前記類似
度算出機能２０３は前記文書ベクトル決定機能２０５で
決定されたサブ文書の文書ベクトルにより隣接する２つ
のサブ文書間の類似度を算出する。請求項９に記載した
発明では、図１５、図１６に示すように、請求項７又は
請求項８に記載した記憶媒体において、前記判定機能２
０４は、前記類似度算出機２０３能で算出されたサブ文
書間の類似度からトピックの変わり目を仮判定し、前記
類似度算出機能２０３は、前記判定機能２０４で仮判定
されたトピックの変わり目により再分割したサブ文書群
間の類似度を更に算出し、前記判定機能２０４は、前記
類似度算出機能２０３で算出されたサブ文書群間の類似
度から前記文書に複数のトピックが含まれるか否かを判
定する。請求項１０に記載した発明では、図１７に示す
１例のように、請求項７、請求項８、又は請求項９に記
載した記憶媒体において、複数の文書で構成される文書
の要約を自動的に作成する要約作成機能２０６を有し、
前記要約作成機能２０６は前記判定機能２０４により前
記文書に複数のトピックが含まれると判定された場合、
トピックを構成する単位で要約を作成する。請求項１１
に記載した発明では、図１８に示す１例のように、、請
求項７から請求項１０のうちのいずれか１の請求項に記
載した記憶媒体において、所定のデータと他のデータと
の関連付けを行う関連付け機能２０７を有し、前記関連
付け機能２０７は前記判定機能２０４により判定された
トピックを構成する単位で他のデータとの関連付けを行
う。請求項１２に記載した発明では、図１５から図１８
に示す１例のように、請求項７から請求項１１のうちの
いずれか１の請求項に記載した記憶媒体において、前記
判定機能２０４により複数のトピックスが含まれないと
判断された場合、前記分割機能２０２は、異なるサイズ
のサブ文書に再分割し、前記類似度算出機能２０３は、
再分割後のサブ文書間の類似度を再算出し、前記判定機
能２０４は、再算出後の類似度から前記文書に複数のト
ピックが含まれるか否かを再判定する。請求項１３に記
載した発明では、図１９に示すように、複数の文章で構
成された所定形式の文書を取得３０１し、取得した文書
を複数のサブ文書に分割３０２し、分割したサブ文書間
の類似度を算出３０３し、算出したサブ文書間の類似度
から前記文書に複数のトピックが含まれるか否かを判定
３０４する、ことにより前記第３の目的を達成する。請
求項１４に記載した発明では、図２０に示すように、請
求項１３に記載した文書処理方法において、隣接するサ
ブ文書間の類似度を、分割したサブ文書を特徴づける文
書ベクトルを決定３０３ａし、決定したサブ文書の文書
ベクトルにより算出３０３ｂする。請求項１５に記載し
た発明では、図２１に示す１例のように、請求項１３ま
たは請求項１４に記載した発明において、文書に複数の
トピックが含まれると判定された場合、トピックを構成
する単位で要約３０５を作成する。According to the first aspect of the present invention, as shown in FIG. 11, a document acquisition unit 101 for acquiring a document of a predetermined format composed of a plurality of sentences is provided. A document dividing means for dividing the acquired document into a plurality of sub-documents;
A similarity calculating unit 103 for calculating the similarity between the sub-documents divided by the sub-document, and determining whether the document includes a plurality of topics based on the similarity between the sub-documents calculated by the similarity calculating unit 103. The first object is achieved by providing the document processing apparatus with the determination means 104. According to the second aspect of the present invention, as shown in FIG. 12, in the document processing apparatus according to the first aspect, a document vector determining unit that determines a document vector characterizing the sub-document divided by the document dividing unit 102 105, and the similarity calculating means 103 includes the document vector determining means 105
The similarity between the sub-documents is calculated based on the document vector of the sub-document determined in (1). In the invention described in claim 3, as shown in FIGS. 11 and 12, claim 1 or claim 2
In the document processing apparatus described in the above item, the determination means 104
Tentatively determines a topic change based on the similarity between sub-documents calculated by the similarity calculation means 103, and the similarity calculation means 103 re-divides the topic based on the topic change tentatively determined by the determination means 104. The similarity between the sub-document groups is further calculated, and the determination unit 104 determines whether the document includes a plurality of topics from the similarity between the sub-document groups calculated by the similarity calculation unit 103. judge. According to the invention described in claim 4, in the document processing device described in claim 1, claim 2, or claim 3, as in the example shown in FIG. It has a summary creation means for automatically creating a summary. When the determination means 104 determines that the document includes a plurality of topics, the summary creation means 106 creates a summary in units of topics. According to the invention described in claim 5, as in an example shown in FIG.
The document processing apparatus according to any one of claims 1 to 4, further comprising an associating unit 107 for associating predetermined data with other data, wherein the associating unit 107 The determined topic is associated with other data in units constituting the topic.
According to the invention described in claim 6, in the document processing apparatus according to any one of claims 1 to 5, as in an example shown in FIGS. When it is determined that a plurality of topics are not included, the dividing unit re-divides into sub-documents of different sizes, and the similarity calculating unit re-calculates the similarity between the sub-documents after the sub-dividing, The determination unit re-determines whether the document includes a plurality of topics based on the similarity after the recalculation. In the invention described in claim 7, as shown in FIG. 15, a document acquisition function 201 for acquiring a document composed of a plurality of sentences in a predetermined format, and a document acquired by the document A document division function 202 for dividing into documents, a similarity calculation function 203 for calculating the similarity between the sub-documents divided by the document division function 202,
Computer-readable document processing for causing a computer to implement a determination function 204 for determining whether the document includes a plurality of topics based on the similarity between sub-documents calculated by the similarity calculation function 203 The second object is achieved by storing the program in a storage medium. In the invention according to claim 8, as shown in FIG. 16, in the storage medium according to claim 7, the document division function 2
02 is provided with a document vector determination function 205 for determining a document vector characterizing the sub-document divided in 02, and the similarity calculation function 203 uses two adjacent document vectors based on the document vector of the sub-document determined by the document vector determination function 205. Calculate the similarity between sub-documents. According to the ninth aspect of the present invention, as shown in FIGS. 15 and 16, in the storage medium according to the seventh or eighth aspect, the determination function 2
04 tentatively determines a topic change based on the similarity between sub-documents calculated by the similarity calculator 203, and the similarity calculation function 203 determines a topic change based on the topic change tentatively determined by the determination function 204. The similarity between the sub-documents subdivided is further calculated, and the determination function 204 determines whether the document includes a plurality of topics based on the similarity between the sub-documents calculated by the similarity calculation function 203. Is determined. According to the tenth aspect, as in the example shown in FIG. 17, in the storage medium according to the seventh, eighth, or ninth aspect, a summary of a document including a plurality of documents is automatically set. Has a summary creation function 206 to create the summary,
The summarizing function 206 determines whether the document includes a plurality of topics by the determining function 204.
Create summaries in units that make up topics. Claim 11
In the storage medium according to any one of claims 7 to 10, as in the example shown in FIG. 18, the predetermined data is associated with other data. The associating function 207 performs the associating with other data in units constituting the topic determined by the determining function 204. In the invention according to claim 12, FIGS.
In the storage medium according to any one of claims 7 to 11, when the determination function 204 determines that a plurality of topics are not included, as in an example shown in FIG. The division function 202 re-divides into sub-documents of different sizes, and the similarity calculation function 203
The similarity between the sub-documents after the re-division is re-calculated, and the determination function 204 re-determines whether or not the document includes a plurality of topics based on the re-calculated similarity. In the invention according to claim 13, as shown in FIG. 19, a document in a predetermined format composed of a plurality of sentences is acquired 301, the acquired document is divided into a plurality of sub-documents 302, and the divided The third object is achieved by calculating 303 the similarity of the sub-documents and determining 304 from the calculated similarity between the sub-documents whether the document includes a plurality of topics. According to the fourteenth aspect of the present invention, as shown in FIG. 20, in the document processing method according to the thirteenth aspect, the similarity between adjacent sub-documents is determined by determining a document vector characterizing the divided sub-documents 303a. , 303b based on the determined document vector of the sub-document. In the invention described in claim 15, as in the example shown in FIG. 21, when it is determined in the invention described in claim 13 or 14 that a document includes a plurality of topics, the topic is formed. An abstract 305 is created in units.

【０００６】[0006]

【発明の実施の形態】以下、本発明の文書処理装置、文
書処理プログラムが記憶された記憶媒体、及び文書処理
方法における好適な実施の形態について、図１から図１
０を参照して詳細に説明する。（１）実施形態の概要本実施形態では、文書を複数のサブ文書（ユニット）に
分割し、各サブ文書毎の文書ベクトルを求め、サブ文書
間もしくは、複数のサブ文書のセット間で文書ベクトル
の差をとる。これらの連続する２つのサブ文書間のコサ
インバリュー（cosine value）が著しく低い箇所で再分
割を行う。その再分割位置の前ｍユニット、後ｎユニッ
トの間の全体的な類似性の判定も行い、それが所定の閾
値Ｔ２以下になった際にトピックの変わり目と最終判定
する。そして、単一のトピックの領域毎に従来の要約処
理を適用することにより、１つの要約でなく、一種の複
数の要約の集まりとして文書全体の要約を生成する。な
お、各トピック毎の要約結合の際に、各トピックとされ
た複合名詞句をサマリー中のサブセクションのタイトル
として明示するようにしてもよい。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of a document processing apparatus, a storage medium storing a document processing program, and a document processing method according to the present invention will be described below with reference to FIGS.
This will be described in detail with reference to FIG. (1) Overview of Embodiment In this embodiment, a document is divided into a plurality of sub-documents (units), a document vector for each sub-document is obtained, and a document vector is set between sub-documents or between a set of a plurality of sub-documents. Take the difference. Subdivision is performed at places where the cosine value between these two consecutive sub-documents is extremely low. The overall similarity between the m units before and n units after the subdivision position is also determined, and when the similarity falls below a predetermined threshold T2, it is finally determined that the topic is a change of topic. Then, by applying the conventional summarization process for each single topic area, a summary of the entire document is generated as a set of a plurality of types of summaries instead of one summary. In addition, at the time of summarizing and combining each topic, the compound noun phrase set as each topic may be specified as the title of the subsection in the summary.

【０００７】（２）実施形態の詳細図１は、文書処理装置の構成を表したブロック図であ
る。本実施形態の文書処理装置は、パーソナルコンピュ
ータやワードプロセッサ等を含むコンピュータシステム
として構成し、また、ＬＡＮ（ローカル・エリア・ネッ
トワーク）のサーバやインターネットを含むコンピュー
タ（パソコン）通信のホストとして構成することが可能
である。文書処理装置は、図１に示すように装置全体を
制御するための制御部１１を備えている。この制御部１
１には、データバス等のバスライン２１を介して、入力
装置としてのキーボード１２やマウス１３、表示装置１
４、印刷装置１５、記憶装置１６、記憶媒体駆動装置１
７、通信制御装置１８、入出力Ｉ／Ｆ１９、および、文
字認識装置２０が接続されている。制御部１１は、ＣＰ
Ｕ１１１、ＲＯＭ１１２、ＲＡＭ１１３を備えている。
ＲＯＭ１１２は、ＣＰＵ１１１が各種制御や演算を行う
ための各種プログラムやデータが予め格納されたリード
オンリーメモリである。(2) Details of Embodiment FIG. 1 is a block diagram showing the configuration of a document processing apparatus. The document processing apparatus according to the present embodiment may be configured as a computer system including a personal computer, a word processor, and the like, and may be configured as a LAN (local area network) server or a computer (PC) communication host including the Internet. It is possible. The document processing apparatus includes a control unit 11 for controlling the entire apparatus as shown in FIG. This control unit 1
1 includes a keyboard 12 and a mouse 13 as input devices and a display device 1 via a bus line 21 such as a data bus.
4, printing device 15, storage device 16, storage medium drive device 1
7, a communication control device 18, an input / output I / F 19, and a character recognition device 20 are connected. The control unit 11 controls the CP
U111, ROM112, and RAM113 are provided.
The ROM 112 is a read-only memory in which various programs and data for the CPU 111 to perform various controls and calculations are stored in advance.

【０００８】ＲＡＭ１１３は、ＣＰＵ１１１にワーキン
グメモリとして使用されるランダムアクセスメモリであ
る。このＲＡＭ１１３には、本実施形態による要約処理
を行うためのエリアとして、要約対象文書格納エリア１
１３１、要約パラメータ格納エリア１１３２、区切れ位
置格納エリア１１３３、文書ベクトル格納エリア１１３
４、要約格納エリア１１３５、その他の各種エリアが確
保されるようになっている。文書ベクトル格納エリア１
１３４には、要約対象文書に対する文書ベクトルと、後
述する各サブ文書に対する文書ベクトルとが格納され
る。要約格納エリア１１３５には、本実施形態により発
見された各トピックを含む各サブ文書群に対するサブ要
約と、要約対象文書全体に対する要約とが格納される。The RAM 113 is a random access memory used as a working memory by the CPU 111. In the RAM 113, the summarization target document storage area 1 is used as an area for performing the summarization process according to the present embodiment.
131, summary parameter storage area 1132, break position storage area 1133, document vector storage area 113
4. The summary storage area 1135 and other various areas are secured. Document vector storage area 1
The document vector 134 stores a document vector for the document to be summarized and a document vector for each sub-document described later. The summary storage area 1135 stores a sub-summary for each sub-document group including each topic discovered according to the present embodiment, and a summary for the entire summarization target document.

【０００９】キーボード１２は、かな文字を入力するた
めのかなキーやテンキー、各種機能を実行するための機
能キー、カーソルキー、等の各種キーが配置されてい
る。マウス１３は、ポインティングデバイスであり、表
示装置１４に表示されたキーやアイコン等を左クリック
することで対応する機能の指定を行う入力装置である。
表示装置１４は、例えばＣＲＴや液晶ディスプレイ等が
使用される。この表示装置には、要約対象文書の内容
や、本実施形態により自動生成された要約の内容等が表
示されるようになっている。印刷装置１５は、表示装置
１４に表示された文章や、記憶装置１６の文書格納部１
６４に格納された文書等の印刷を行うためのものであ
る。この印刷装置としては、レーザプリンタ、ドットプ
リンタ、インクジェットプリンタ、ページプリンタ、感
熱式プリンタ、熱転写式プリンタ、等の各種印刷装置が
使用される。The keyboard 12 has various keys such as a kana key and a numeric keypad for inputting a kana character, a function key for executing various functions, a cursor key, and the like. The mouse 13 is a pointing device, and is an input device for designating a corresponding function by left-clicking a key, an icon, or the like displayed on the display device 14.
As the display device 14, for example, a CRT or a liquid crystal display is used. The display device displays the content of the document to be summarized, the content of the summary automatically generated by the present embodiment, and the like. The printing device 15 stores the text displayed on the display device 14 and the document storage unit 1 of the storage device 16.
64 for printing a document or the like stored in the storage device 64. Various printing apparatuses such as a laser printer, a dot printer, an ink jet printer, a page printer, a thermal printer, and a thermal transfer printer are used as the printing apparatus.

【００１０】記憶装置１６は、読み書き可能な記憶媒体
と、その記憶媒体に対してプログラムやデータ等の各種
情報を読み書きするための駆動装置で構成されている。
この記憶装置１６に使用される記憶媒体としては、主と
してハードディスクが使用されるが、後述の１７で使用
される各種記憶媒体のうちの読み書き可能な記憶媒体を
使用するようにしてもよい。記憶装置１６は、仮名漢字
変換辞書１６１、プログラム格納部１６２、データ格納
部１６３、文書データベース１６４、要約データベース
１６５、文書ベクトルデータベース１６６、図示しない
その他の格納部（例えば、この記憶装置１６内に格納さ
れているプログラムやデータ等をバックアップするため
の格納部）等を有している。プログラム格納部１６２に
は、本実施形態における自動要約処理プログラム、文書
ベクトル作成処理プログラム、要約作成処理プログラム
等の各種プログラムの他、仮名漢字変換辞書１６１を使
用して入力された仮名文字列を漢字混り文に変換する仮
名漢字変換プログラム等の各種プログラムが格納されて
いる。データ格納部１６３には、要約パラメータのデフ
ォルト値等の各種データが格納されている。要約パラメ
ータのデフォルト値としては、例えば、全文書に対する
要約の比率＝「２５％」や、日付時刻、価格情報、物理
量（サイズ、重量、温度等）等の数量重視＝「しない」
や、ＵＲＬ（Uniform Resource Locator）重視＝「しな
い」、長単文の重視＝「しない」や、です／ます／であ
るの選択＝「しない」、等の値が格納されている。The storage device 16 is composed of a readable and writable storage medium and a drive device for reading and writing various information such as programs and data on the storage medium.
As a storage medium used for the storage device 16, a hard disk is mainly used, but a readable and writable storage medium among various storage media used in 17 described later may be used. The storage device 16 includes a kana-kanji conversion dictionary 161, a program storage unit 162, a data storage unit 163, a document database 164, a summary database 165, a document vector database 166, and other storage units (not shown) (for example, stored in the storage device 16). (A storage unit for backing up programs, data, and the like). In the program storage unit 162, in addition to various programs such as an automatic summarization processing program, a document vector creation processing program, and a summarization creation program according to the present embodiment, a kana character string input using the kana-kanji conversion dictionary 161 is stored in a kanji character. Various programs such as a kana-kanji conversion program for converting into a mixed sentence are stored. The data storage 163 stores various data such as default values of summary parameters. As the default value of the summary parameter, for example, the ratio of the summary to all documents = “25%”, or the emphasis on quantity such as date / time, price information, physical quantity (size, weight, temperature, etc.) = “No”
And values such as URL (Uniform Resource Locator) emphasis = “No”, long and simple sentence emphasis = “No”, and “/” / “/” selection = “No” are stored.

【００１１】文書データベース１６４には、仮名漢字変
換プログラムにより作成された文書や、他の装置で作成
されて記憶媒体駆動装置１７や通信制御装置１８から読
み込まれた文書が格納される。この文書データベース１
６４に格納される各文書の形式は特に限定されるもので
はなく、テキスト形式の文書、ＨＴＭＬ（Hyper TextMa
rkup Language）形式の文書、ＪＩＳ形式の文書等の各
種形式の文書の格納が可能である。文書データベース１
６４には、これらの形式の文書データの他、本実施形態
により発見されたトピックの変わり目となる区切れ位置
Ｘのデータ等も格納される。要約データベース１６５、
及び文書ベクトルデータベース１６６には、文書データ
ベース１６４に格納されている各文書に対応する要約や
文書ベクトルが格納されるようになっている。The document database 164 stores documents created by the kana-kanji conversion program and documents created by other devices and read from the storage medium driving device 17 or the communication control device 18. This document database 1
The format of each document stored in the H.64 is not particularly limited, and a text format document, HTML (Hyper Text Ma
It is possible to store documents in various formats, such as a document in an rkup language) format and a document in a JIS format. Document database 1
In addition to the document data in these formats, data of a break position X, which is a transition of a topic discovered according to the present embodiment, is stored in the file 64. Summary database 165,
The document vector database 166 stores summaries and document vectors corresponding to the documents stored in the document database 164.

【００１２】図２は、文書ベクトルデータベース１６６
の内容を概念的に表したものである。この図２に示され
るように、文書中から自動抽出されたキーワードｘに対
して求められた重要度ｆ（ｘ）が文書ベクトルの要素値
ｆ（ｘ）として格納されている。この文書ベクトルは各
文書（Ａ、Ｂ、Ｃ…）毎に格納され、文書データベース
１６４に格納されている各文書と対応づけられている。
各文書ベクトルの次元は採用するキーワードｘ（重要語
句）の数であるが、２文書間の類似度を両文書ベクトル
から求める場合には、両文書のキーワードの和集合の数
が両文書ベクトルの次元となる。この場合、一方の文書
ベクトルにのみ含まれるキーワードに対する他方の文書
ベクトルの要素値は、”０”に定義される。FIG. 2 shows a document vector database 166.
Are conceptually represented. As shown in FIG. 2, the importance f (x) obtained for the keyword x automatically extracted from the document is stored as the element value f (x) of the document vector. This document vector is stored for each document (A, B, C...), And is associated with each document stored in the document database 164.
The dimension of each document vector is the number of keywords x (keywords) to be adopted. When the similarity between two documents is obtained from both document vectors, the number of unions of keywords of both documents is Be a dimension. In this case, the element value of the other document vector for the keyword included in only one document vector is defined as “0”.

【００１３】例えば図２おいて、文書Ｂのキーワードは
「重要、重要語、重要度、…」、文書Ｃのキーワードは
「重要、…、政治、…」であり、両文書の文書ベクトル
は次の通りである。文書Ｂの文書ベクトル＝（１，１８，１９，…）文書Ｃの文書ベクトル＝（１８，…，２１，…）これに対して文書Ｂと文書Ｃとの類似度を算出する場合
には、両文書のキーワードを「重要、重要語、重要度、
…、政治、…」とし、両文書の文書ベクトルはつぎの通
り定義される。文書Ａの文書ベクトル＝（１，１８，１９，…，
０，…）、文書Ｃの文書ベクトル＝（１８，０，０，…，２
１，…）For example, in FIG. 2, the keywords of document B are “important, important words, importance,...”, The keywords of document C are “important,..., Politics,. It is as follows. Document vector of document B = (1,18,19, ...) Document vector of document C = (18, ..., 21, ...) On the other hand, when calculating the similarity between document B and document C, Keywords for both documents are "important, important words, importance,
..., politics, ... ", and the document vectors of both documents are defined as follows. Document vector of document A = (1,18,19, ...,
0,...), Document vector of document C = (18, 0, 0,.
1,…)

【００１４】記憶媒体駆動装置１７は、ＣＰＵ１１１が
外部の記憶媒体からコンピュータプログラムや文書を含
むデータ等を読み込むための駆動装置である。記憶媒体
に記憶されているコンピュータプログラム等には、本実
施形態の文書処理装置により実行される各種処理のため
のプログラム、および、そこで使用される辞書、データ
等も含まれる。ここで、記憶媒体とは、コンピュータプ
ログラムやデータ等が記憶される記憶媒体をいい、具体
的には、フロッピーディスク、ハードディスク、磁気テ
ープ等の磁気記憶媒体、メモリチップやＩＣカード等の
半導体記憶媒体、ＣＤ−ＲＯＭやＭＯ、ＰＤ（相変化書
換型光ディスク）等の光学的に情報が読み取られる記憶
媒体、紙カードや紙テープ等の用紙（および、用紙に相
当する機能を持った媒体）を用いた記憶媒体、その他各
種方法でコンピュータプログラム等が記憶される記憶媒
体が含まれる。本実施形態の文書処理装置において使用
される記憶媒体としては、主として、ＣＤ−ＲＯＭやフ
ロッピーディスク等の記憶媒体がが使用される。記憶媒
体駆動装置１７は、これらの各種記憶媒体からコンピュ
ータプログラムを読み込む他に、フロッピーディスクの
ような書き込み可能な記憶媒体に対してＲＡＭ１１３や
記憶装置１６に格納されているデータ等を書き込むこと
が可能である。The storage medium drive 17 is a drive for the CPU 111 to read a computer program or data including a document from an external storage medium. The computer programs and the like stored in the storage medium include programs for various processes executed by the document processing apparatus of the present embodiment, and dictionaries and data used therein. Here, the storage medium refers to a storage medium in which a computer program, data, and the like are stored, and specifically, a magnetic storage medium such as a floppy disk, a hard disk, and a magnetic tape, and a semiconductor storage medium such as a memory chip and an IC card. A storage medium such as a CD-ROM, an MO, a PD (phase change rewritable optical disk) or the like, from which information can be read optically, and a paper such as a paper card or a paper tape (and a medium having a function equivalent to the paper) are used. It includes a storage medium and a storage medium in which a computer program or the like is stored by various methods. As a storage medium used in the document processing apparatus of the present embodiment, a storage medium such as a CD-ROM or a floppy disk is mainly used. The storage medium drive 17 can read data stored in the RAM 113 or the storage device 16 into a writable storage medium such as a floppy disk in addition to reading a computer program from these various storage media. It is.

【００１５】本実施形態の文書処理装置では、制御部１
１のＣＰＵ１１１が、記憶媒体駆動装置１７にセットさ
れた外部の記憶媒体からコンピュータプログラムを読み
込んで、記憶装置１６の各部に格納（インストール）す
る。そして、本実施形態による自動要約処理等の各種処
理を実行する場合、記憶装置１６から該当プログラムを
ＲＡＭ１１３に読み込み、実行するようになっている。
但し、記憶装置１６からではなく、記憶媒体駆動装置１
７により外部の記憶媒体から直接ＲＡＭ１１３にプログ
ラムを読み込んで実行することも可能である。また、文
書処理装置によっては、本実施形態の自動要約処理プロ
グラム等を予めＲＯＭ１１２に記憶させておき、これを
ＣＰＵ１１１が実行するようにしてもよい。さらに、本
実施形態の自動要約処理プログラム等の各種プログラム
やデータを、通信制御装置１８を介して他の記憶媒体か
らダウンロードし、実行するようにしてもよい。In the document processing apparatus of the present embodiment, the control unit 1
One CPU 111 reads a computer program from an external storage medium set in the storage medium drive 17 and stores (installs) it in each unit of the storage 16. When executing various processes such as the automatic summarization process according to the present embodiment, the corresponding program is read from the storage device 16 into the RAM 113 and executed.
However, not from the storage device 16 but the storage medium drive 1
7, it is also possible to read the program directly from the external storage medium into the RAM 113 and execute it. Further, depending on the document processing apparatus, the automatic summarization processing program and the like of the present embodiment may be stored in the ROM 112 in advance and the CPU 111 may execute the program. Further, various programs and data such as the automatic summarization processing program of the present embodiment may be downloaded from another storage medium via the communication control device 18 and executed.

【００１６】通信制御装置１８は、他のパーソナルコン
ピュータやワードプロセッサ等との間でテキスト形式や
ＨＴＭＬ形式等の各種形式の文書やビットマップデータ
等の各種データの送受信を行うことができるようになっ
ている。入出力Ｉ／Ｆ１９は、音声や音楽等の出力を行
うスピーカ等の各種機器を接続するためのインターフェ
ースである。文字認識装置２０は、用紙等に記載された
文字をテキスト形式やＨＴＭＬ等の各種形式で認識する
装置であり、イメージスキャナや文字認識プログラム等
で構成されている。The communication control device 18 can transmit and receive various data such as documents in various formats such as text format and HTML format and bitmap data to and from other personal computers and word processors. I have. The input / output I / F 19 is an interface for connecting various devices such as a speaker that outputs audio, music, and the like. The character recognition device 20 is a device for recognizing characters written on paper or the like in various formats such as a text format or HTML, and is configured by an image scanner, a character recognition program, and the like.

【００１７】本実施形態では、キーボード１２の入力操
作により作成した文書（ＲＡＭ１１３の所定格納エリア
に格納）の他、外部で作成して所定の記憶媒体に格納し
た文書で記憶媒体駆動装置１７から読み込んだ文書、予
め文書データベースに格納されている文書、通信制御装
置１８からダウンロードした文書、及び文字認識装置２
０で文字認識した文書、等の各種文書を対象文書として
取得する（文書取得手段）ことが可能である。In this embodiment, in addition to a document created by an input operation of the keyboard 12 (stored in a predetermined storage area of the RAM 113), a document created externally and stored in a predetermined storage medium is read from the storage medium driving device 17. Documents, documents stored in advance in a document database, documents downloaded from the communication control device 18, and the character recognition device 2.
It is possible to acquire various documents such as a document whose characters have been recognized as 0 as a target document (document acquisition means).

【００１８】以上のように構成された本実施形態の文書
処理装置による、トピック数に応じた要約を作成する自
動要約処理の動作について図３から図１０を用いて説明
する。図３は自動要約処理のメイン動作を表したもので
あり、図４〜図８は自動要約処理の各工程における処理
を概念的に表したものである。この図３のフローチャー
トの右側に記した（Ａ）〜（Ｉ）は図４から図８の
（Ａ）〜（Ｉ）に対応したものである。図４（Ａ）〜図
８（Ｉ）中に示した文書ベクトルは、概念的に理解しや
すくするために２次元で表示したものであるが、実際に
はＮ次元ベクトルである。ＣＰＵ１１１は、要約を作成
する対象となっている要約対象文書Ａ（図４（Ａ））を
取得し、ＲＡＭ１１３の要約対象文書格納エリア１１３
１に格納する（ステップ１１）。要約対象文書Ａは、ユ
ーザの指示に従ってＲＡＭ１１３（自装置内で作成され
た文書である場合）、記憶装置１６の文書データベース
１６４（要約が未だ作成されていない文書である場
合）、記憶媒体駆動装置１７（自装置または他装置で作
成済みの文書の場合）、通信制御装置１８（パソコン通
信、インターネット等の通信による場合）から取得す
る。The operation of the automatic summarization process for creating summaries according to the number of topics by the document processing apparatus of the present embodiment configured as described above will be described with reference to FIGS. FIG. 3 illustrates a main operation of the automatic summarization process, and FIGS. 4 to 8 conceptually illustrate processing in each step of the automatic summarization process. (A) to (I) shown on the right side of the flowchart in FIG. 3 correspond to (A) to (I) in FIGS. 4 to 8. The document vectors shown in FIGS. 4A to 8I are two-dimensionally displayed for easy conceptual understanding, but are actually N-dimensional vectors. The CPU 111 acquires the summary target document A (FIG. 4A) for which a summary is to be created, and stores the summary target document storage area 113 of the RAM 113.
1 (step 11). The document A to be summarized is a RAM 113 (in the case where the document has been created in the own device), a document database 164 of the storage device 16 (in the case where a document has not been created yet), and a storage medium driving device in accordance with a user's instruction. 17 (in the case of a document created by the own device or another device), and from the communication control device 18 (in the case of communication by personal computer communication, the Internet or the like).

【００１９】次にＣＰＵ１１１は、ユーザによってキー
ボード１２等から要約パラメータが入力された場合には
入力値を取得し、ユーザによる入力がない場合にはデー
タ格納部１６３に格納された要約パラメータのデフォル
ト値を取得し、要約パラメータ格納エリア１１３２に格
納する（ステップ１２）。Next, the CPU 111 obtains an input value when the user inputs a summary parameter from the keyboard 12 or the like, and obtains a default value of the summary parameter stored in the data storage unit 163 when there is no input by the user. Is acquired and stored in the summary parameter storage area 1132 (step 12).

【００２０】次にＣＰＵ１１１は、要約対象文書格納エ
リア１１３１に格納した要約対象文書Ａに対する文書ベ
クトルＶ（図４（Ｂ））を求める（ステップ１３）。図
９は、文書ベクトル作成処理の動作を表したフローチャ
ートである。ＣＰＵ１１１は、形態素解析を行うことで
要約対象文書Ａから自立語を抽出する（ステップ１３
１）と共に、名詞句、複合名詞句等を含めた候補語
（句）を要約対象文書Ａから抽出しＲＡＭ１１３の所定
作業領域に格納する（ステップ１３２）。そして抽出し
た候補語（句）の要約対象文書Ａでの出現頻度、評価関
数から、各候補語（句）の重要度ｆ（ｘ）を決定する
（ステップ１３３）。ここで、評価関数としては、例え
ば、所定の重要語が予め指定されている場合にはその重
要語に対する重み付け、単語、名詞句、複合名詞句等の
候補語（句）の種類による重み付け等が使用される。さ
らにＣＰＵ１１１は、決定した重要度ｆ（ｘ）の値から
要約対象文書Ａのキーワードａ，ｂ，…を決定する（ス
テップ１３４）。そして、各キーワードの重要度ｆ
（ｘ）を要素として、文書ベクトルＶ＝（ｆ（ａ），ｆ
（ｂ），…）をＲＡＭ１１３の文書ベクトル格納エリア
１１３４に格納して（ステップ１３５）、図３の自動要
約処理ルーチンにリターンする。Next, the CPU 111 obtains a document vector V (FIG. 4B) for the digest document A stored in the digest document storage area 1131 (step 13). FIG. 9 is a flowchart showing the operation of the document vector creation processing. The CPU 111 extracts a self-sustaining word from the digest target document A by performing morphological analysis (step 13).
Along with 1), candidate words (phrases) including noun phrases, compound noun phrases, etc. are extracted from the document A to be summarized and stored in a predetermined work area of the RAM 113 (step 132). Then, the importance f (x) of each candidate word (phrase) is determined from the appearance frequency of the extracted candidate word (phrase) in the document A to be summarized and the evaluation function (step 133). Here, as the evaluation function, for example, when a predetermined important word is specified in advance, weighting for the important word, weighting according to the type of a candidate word (phrase) such as a word, a noun phrase, a compound noun phrase, and the like are used. used. Further, the CPU 111 determines keywords a, b,... Of the document A to be summarized from the value of the determined importance f (x) (step 134). And the importance f of each keyword
The document vector V = (f (a), f
(B),...) Are stored in the document vector storage area 1134 of the RAM 113 (step 135), and the process returns to the automatic summarization processing routine of FIG.

【００２１】文書ベクトルＶが求まるとＣＰＵ１１１
は、図５（Ｃ）に示すように要約対象文書Ａを所定数の
Ｐ個のサブ文書Ａ１，Ａ２，…，ＡＰに分割する。サブ
文書の分割方法は任意であり、具体的には、ｒ文字数毎
に分割、ｓ行数毎に分割、ｔページ毎に分割、ｕセンテ
ンス毎に分割、全文字数を１／Ｒに分割、全行数を１／
Ｓに分割、全ページを１／Ｔに分割、全センテンスを１
／Ｕに分割、等の方法がある。また、サブ文書サイズを
一定サイズで分割せず、文書中の一部（例えば、文書
頭、文書中央、文書末等）を他の部分よりも大きなサイ
ズのサブ文書とすることも可能である。これらの分割方
法は、いずれか１の方法が予め規定され、または、ユー
ザにより要約パラメータの１つとして選択可能にしても
よい。ＣＰＵ１１１は、分割による切れ目がセンテンス
の途中になる場合には、そのセンテンス全体が前のサブ
文書に含まれる位置をサブ文書の区切れ位置Ｘとして各
サブ文書の区切れ位置Ｘｎ（ｎ＝１〜（Ｐ−１））を求
め、区切れ位置格納エリア１１３３に格納する（ステッ
プ１４）。When the document vector V is obtained, the CPU 111
Divides the document A to be summarized into a predetermined number of P sub-documents A1, A2,..., AP as shown in FIG. The sub-document may be divided in any manner. Specifically, the sub-document is divided by r characters, divided by s lines, divided by t pages, divided by u sentences, divided by 1 / R, and divided by 1 / R. 1 /
Split into S, all pages into 1 / T, all sentences into 1
/ U is divided. Further, it is also possible to divide a part of the document (for example, the head of the document, the center of the document, the end of the document, etc.) into a sub-document having a larger size than other parts without dividing the sub-document size into a certain size. Any one of these division methods may be defined in advance, or may be selectable by the user as one of the summary parameters. When the break due to the division is in the middle of the sentence, the CPU 111 sets the position where the entire sentence is included in the previous sub-document as the break position X of the sub-document, and the break position Xn (n = 1 to 1) of each sub-document. (P-1)) is obtained and stored in the break position storage area 1133 (step 14).

【００２２】次にＣＰＵ１１１は、図９に従って説明し
た文書ベクトル作成処理により、区切れ位置Ｘで区切ら
れた各サブ文書Ａ１〜ＡＰをそれぞれ１つの文書とみな
して文書ベクトルｂ１〜ｂＰ（図５（Ｄ））を求める
（ステップ１５）。そして、図６（Ｅ）に示すように、
互いに隣接するサブ文書ＡｎとＡｎ＋１（ｎ＝１〜Ｐ−
１）との間の類似度ｓｎｎ＋１を、両者の文書ベクトル
ｂｎと文書ベクトルｂｎ＋１間の角度に依存するコサイ
ンにより求める（ステップ１６）。すなわち、両文書ベ
クトルｂｎとｂｎ＋１間の角度をｑとし、両文書ベクト
ルの内積をｂｎ・ｂｎ＋１とし、両文書ベクトルの大き
さをそれぞれ｜ｂｎ｜、｜ｂｎ＋１｜とした場合、両文
書ベクトルの類似度ｓｎｎ＋１は次の数式１により求ま
る。Next, the CPU 111 regards each of the sub-documents A1 to AP delimited by the delimiting position X as one document by the document vector creation processing described with reference to FIG. D)) is obtained (step 15). Then, as shown in FIG.
Sub-documents An and An + 1 (n = 1 to P-
The similarity snn + 1 between 1 and 2 is obtained by a cosine depending on the angle between the document vector bn and the document vector bn + 1 (step 16). That is, if the angle between the two document vectors bn and bn + 1 is q, the inner product of both document vectors is bn · bn + 1, and the size of both document vectors is | bn | and | bn + 1 | The degree snn + 1 is obtained by the following equation (1).

【００２３】[0023]

【数１】類似度ｓｎｎ＋１＝ＣＯＳ（ｑ）＝（ｂｎ・ｂｎ＋１）／（｜ｂｎ｜×｜ｂｎ＋１｜）## EQU00001 ## Similarity Snn + 1 = COS (q) = (bn.bn + 1) / (| bn | .times. | Bn + 1 |)

【００２４】この類似度ｓの値は−１≦ｓ≦１までの値
をとり、１に近いほど２つの文書ベクトルが互いに平行
に近く、２つのサブ文書同士は似ていると考えることが
できる。The value of the similarity s takes a value up to −1 ≦ s ≦ 1, and the closer the value is to 1, the closer the two document vectors are to each other and the closer the two sub-documents are. .

【００２５】次にＣＰＵ１１１は、算出した類似度ｓｎ
ｎ＋１からトピックの変わり目であると予想される区切
れＸｎを仮判定する。すなわち、各類似度ｓｎｎ＋１と
所定の閾値Ｔ１とを比較し、閾値Ｔ１以下の類似度ｓｎ
ｎ＋に対応する区切りＸｎをトピックの区切れと仮判
定する（ステップ１７）。ここで、ＣＰＵ１１１は、類
似度ｓが閾値Ｔ１以下の区切れＸＣが有るか否かを判断
し（ステップ１８）、ない場合には（ステップ１８；
Ｎ）、文書Ａ全体をサブ文書としてステップ２２に移行
する。一方、類似度ｓが閾値Ｔ１以下の区切れＸｎが有
る場合（ステップ１７；Ｙ）、その区切れＸｎまでのサ
ブ文書群（サブ文書Ａ１からＡｎまで）と、区切れ以降
のサブ文書群（サブ文書Ａｎ＋１からＡＰまで）の文書
ベクトルＢ〜ｎ、Ｂｎ＋１〜を、図６（Ｆ）に示すよう
に、図９に従って説明した文書ベクトル作成処理により
求める（ステップ１９）。なお、類似度ｓから求まる区
切れが複数（ｍ個）ある場合には、各区切れ単位の各サ
ブ文書群ｍ＋１個に対して文書ベクトルを作成するが、
本実施形態では、説明を簡単にするため区切れは１つで
あった場合を例に説明する。Next, the CPU 111 calculates the calculated similarity sn
A break Xn, which is expected to be a topic change from n + 1, is temporarily determined. That is, each similarity snn + 1 is compared with a predetermined threshold T1, and the similarity sn equal to or smaller than the threshold T1 is compared.
A delimiter Xn corresponding to n + is provisionally determined as a topic break (step 17). Here, the CPU 111 determines whether or not there is a section XC whose similarity s is equal to or smaller than the threshold value T1 (step 18).
N), the process proceeds to step 22 with the entire document A as a sub-document. On the other hand, when there is a break Xn whose similarity s is equal to or smaller than the threshold T1 (step 17; Y), a sub-document group (from sub-documents A1 to An) up to the break Xn and a sub-document group (from sub-document A1 to An) As shown in FIG. 6F, the document vectors B to n and Bn + 1 of the sub-documents An + 1 to AP are obtained by the document vector creation processing described with reference to FIG. 9 (step 19). If there are a plurality (m) of breaks obtained from the similarity s, a document vector is created for each sub-document group m + 1 of each break unit.
In the present embodiment, for simplicity of description, a case where there is one break will be described as an example.

【００２６】次にＣＰＵ１１１は、ステップ１６と同様
に、前記した数式１に従って隣接するサブ文書群間の類
似度Ｓを算出する（ステップ２０、図７（Ｇ））。そし
て、類似度Ｓが所定の閾値Ｔ２よりも大きい場合、ステ
ップ１７で仮判定した区切れＸｎは細かなサブ文書に分
割したためにたまたま隣接するサブ文書Ａｎ、Ａｎ＋１
の両文書ベクトルｂｎとｂｎ＋ｔとが離れたものと判断
できるので、区切れＸｎはトピックの変わり目ではない
と判断する。一方、類似度Ｓが所定の閾値Ｔ２以下であ
れば、サブ文書群（Ａ１〜Ａｎ）とサブ文書群（Ａｎ＋
１〜ＡＰ）は異なる内容について記載されており互いに
似ていないと判断できるので、区切れＸｎはトピックの
変わり目であると最終判定し、ＲＡＭ１１３の区切れ位
置格納エリア１１３３に格納する（ステップ２１）。Next, the CPU 111 calculates the similarity S between the adjacent sub-document groups according to the above-described formula 1 as in step 16 (step 20, FIG. 7 (G)). If the similarity S is larger than the predetermined threshold value T2, the break Xn provisionally determined in step 17 happens to be adjacent to the sub-documents An and An + 1 because they are divided into small sub-documents.
It can be determined that the two document vectors bn and bn + t are separated from each other, so that it is determined that the break Xn is not a change of topic. On the other hand, if the similarity S is equal to or smaller than the predetermined threshold T2, the sub-document group (A1 to An) and the sub-document group (An +
1 to AP) are described for different contents and can be determined not to be similar to each other. Therefore, it is finally determined that the break Xn is a change of topic and stored in the break position storage area 1133 of the RAM 113 (step 21). .

【００２７】次にＣＰＵ１１１は、図７（Ｈ）に示すよ
うに、異なるトピックを含む各サブ文書群（Ａ１〜Ａ
ｎ、Ａｎ＋１〜ＡＰ）毎にサブ要約を作成する（ステッ
プ２２）。図１０は、要約作成処理の動作を表したフロ
ーチャートである。ＣＰＵ１１１は、まず形態素解析を
行うことでサブ文書群に含まれる自立語を抽出する（ス
テップ２２１）と共に、名詞句、複合名詞句等を含めた
候補語（句）を要約対象文書Ａから抽出しＲＡＭ１１３
の所定作業領域に格納する（ステップ２２２）。そし
て、ＲＡＭ１６の要約パラメータ格納エリア１１３２に
格納した要約パラメータや、抽出した候補語（句）のサ
ブ文書群中での出現頻度、評価関数等から、各候補語
（句）重要度ｆ（ｙ）を決定する（ステップ２２３）。
ここで、評価関数としては、例えば、所定の重要語が予
め指定されている場合にはその重要語に対する重み付
け、単語、名詞句、複合名詞句等の候補語（句）の種類
による重み付け等が使用される。Next, as shown in FIG. 7H, the CPU 111 sets each sub-document group (A1 to A
(n, An + 1 to AP) are created (step 22). FIG. 10 is a flowchart showing the operation of the summary creation processing. The CPU 111 first extracts independent words included in the sub-document group by performing morphological analysis (step 221), and also extracts candidate words (phrases) including noun phrases, compound noun phrases, and the like from the document A to be summarized. RAM 113
(Step 222). Then, based on the summary parameters stored in the summary parameter storage area 1132 of the RAM 16, the frequency of appearance of the extracted candidate words (phrases) in the sub-document group, the evaluation function, etc., each candidate word (phrase) importance f (y) Is determined (step 223).
Here, as the evaluation function, for example, when a predetermined important word is specified in advance, weighting for the important word, weighting according to the type of a candidate word (phrase) such as a word, a noun phrase, a compound noun phrase, and the like are used. used.

【００２８】さらにＣＰＵ１１１は、決定した重要度ｆ
（ｙ）や要約パラメータ格納エリアリレーに格納された
要約パラメータ等から、サブ文書群含まれる各センテン
スに対する重要度Ｆ（ｚ）を決定する（ステップ２２
４）。そして、決定したセンテンスの重要度Ｆ（ｚ）の
重要度が高いセンテンスの上位から要約パラメータの要
約比率（例えば、サブ文書群の全センテンス数の内の上
位２５％）以内に入るセンテンスをリストアップする
（ステップ２２５）。そしてＣＰＵ１１１は、リストア
ップしたセンテンスをサブ文書群の中での出現順に並べ
ることで当該サブ文書群についてのサブ要約とし、これ
をＲＡＭ１１３の要約格納エリアに格納して（ステップ
２２６）、図３の自動要約処理ルーチンにリターンす
る。Further, the CPU 111 determines the determined importance f
The importance F (z) for each sentence included in the sub-document group is determined from (y) and the summary parameters stored in the summary parameter storage area relay (step 22).
4). Then, the sentences that are within the summary ratio of the summary parameter (for example, the top 25% of the total number of sentences in the sub-document group) are listed from the top of the sentence having the high importance of the determined sentence F (z). (Step 225). Then, the CPU 111 arranges the listed sentences in the order of appearance in the sub-document group to form a sub-summary of the sub-document group, stores this in the summary storage area of the RAM 113 (step 226), and Return to the automatic summarization processing routine.

【００２９】各サブ文書群に対するサブ要約の作成が終
了するとＣＰＵ１１１は、図８（Ｉ）に示すように、要
約格納エリアに格納した全てのサブ要約を合成すること
で要約対象文書Ａについての要約とし、要約格納エリア
１１３６の所定エリアに格納して（ステップ２３）、本
実施形態による自動要約処理を終了する。以上説明した
ように、本実施形態による自動要約処理によれば、１文
書中に複数のトピックを含むか否かを判定し、各トピッ
ク毎のサブ要約を合成して要約を作成しているので、各
トピックの内容を的確に把握することが可能な要約を作
成することができる。When the creation of the sub-summary for each sub-document group is completed, the CPU 111 synthesizes all the sub-summaries stored in the summary storage area as shown in FIG. This is stored in a predetermined area of the summary storage area 1136 (step 23), and the automatic summarization process according to the present embodiment ends. As described above, according to the automatic summarization process according to the present embodiment, it is determined whether or not one document includes a plurality of topics, and a subsummary for each topic is synthesized to create an abstract. In addition, it is possible to create a summary from which the contents of each topic can be accurately grasped.

【００３０】以上の自動要約処理が終了すると、ＣＰＵ
１１１はユーザの指示によりＲＡＭ１１３に格納した各
データの保存処理を行う。すなわち、要約対象文書格納
エリア１１３１から要約対象文書Ａを読み出して、記憶
装置１６の文書データベース１６４に格納する。また作
成した要約を要約格納エリア１１３５から読み出し、文
書データベース１６４に格納した要約対象文書Ａとの関
連性を付けて記憶装置１６の要約データベース１６５に
格納する。さらに、文書ベクトル作成処理（図３のステ
ップ１３、図９）で求めた文書ベクトルＶを文書ベクト
ル格納エリア１１３５から読み出し、文書データベース
１６４に格納した要約対象文書Ａとの関連性を付けて記
憶装置１６の文書ベクトルデータベース１６６に格納す
る。When the above automatic summarization processing is completed, the CPU
Reference numeral 111 performs storage processing of each data stored in the RAM 113 according to a user's instruction. That is, the document A to be summarized is read from the document storage area 1131 to be summarized, and is stored in the document database 164 of the storage device 16. Further, the created summary is read from the summary storage area 1135 and stored in the summary database 165 of the storage device 16 with the relevance to the summary target document A stored in the document database 164 attached. Further, the document vector V obtained in the document vector creation processing (step 13 in FIG. 3, FIG. 9) is read out from the document vector storage area 1135, and is associated with the summarization target document A stored in the document database 164 and added to the storage device. It is stored in 16 document vector databases 166.

【００３１】以上、本実施形態の構成および自動要約処
理について説明したが、本発明では、これらの各形態に
限定されるものではなく、各請求項に記載された発明の
範囲内で種々の変形をすることが可能である。例えば実
施形態では、形態素解析及び候補語（句）の抽出につい
て、文書ベクトル作成処理（図９のステップ１３１とス
テップ１３２）と、要約作成処理（図１０のステップ２
２１とステップ２２２）とにおいて独立して同様な処理
を行うこととしたが、本発明では、文書ベクトル作成処
理で抽出した候補語（句）をＲＡＭ１６の所定エリアに
格納しておき、要約作成処理で利用するようにしてもよ
い。Although the configuration and the automatic summarization process of the present embodiment have been described above, the present invention is not limited to these embodiments, and various modifications are possible within the scope of the invention described in each claim. It is possible to For example, in the embodiment, for morphological analysis and extraction of candidate words (phrases), a document vector creation process (steps 131 and 132 in FIG. 9) and a summary creation process (step 2 in FIG. 10)
21 and step 222), the same processing is performed independently. However, in the present invention, the candidate words (phrases) extracted in the document vector generation processing are stored in a predetermined area of the RAM 16, and the summary generation processing is performed. May be used.

【００３２】また説明した実施形態では、自動要約処理
が終了した後の保存処理において、要約対象文書Ａ、要
約、文書ベクトルＶのみを記憶装置１６の各データベー
ス１６４、１６５、１６６に格納し保存するようにした
が、本発明では更に、文書ベクトル作成処理（図９）の
ステップ１３２で要約対象文書Ａから抽出し、ＲＡＭ１
１３の所定作業領域に格納した候補語（句）を要約対象
文書Ａと関連つけて、文書データベース１６４、又は専
用の候補語（句）データベースに格納するようにしても
よい。また要約パラメータ格納エリア１１３２から要約
パラメータを読み出して、当該要約に関連付けて、要約
データベース１６６、または専用の要約パラメータデー
タベースに格納するようにしてもよい。また、ステップ
２０（図３）において最終的にトピックの変わり目であ
ると判定した区切れＸｎを区切れ位置格納エリア１１３
３から読み出し、要約対象文書Ａと関連つけて、文書デ
ータベース１６４、又は専用のトピック区切れデータベ
ースに格納するようにしてもよい。In the embodiment described above, only the document A to be summarized, the summary, and the document vector V are stored and stored in the respective databases 164, 165, and 166 of the storage device 16 in the storage process after the automatic summarization process is completed. However, in the present invention, in step 132 of the document vector creation processing (FIG. 9), the extracted
The candidate words (phrases) stored in the predetermined work area 13 may be stored in the document database 164 or a dedicated candidate word (phrase) database in association with the document A to be summarized. The summary parameter may be read from the summary parameter storage area 1132 and stored in the summary database 166 or a dedicated summary parameter database in association with the summary. In step 20 (FIG. 3), the break Xn finally determined to be a topic change is stored in the break position storage area 113.
3, and may be stored in the document database 164 or a dedicated topic section database in association with the document A to be summarized.

【００３３】さらに、説明した実施形態では、文書ベク
トル作成処理（ステップ１３、図９）及び要約作成処理
（ステップ２２、図１０）の両処理において、形態素解
析（ステップ１３１、２２１）と候補語（句）の抽出
（ステップ１３２、２２２）を行った。しかし、同一セ
ンテンスに対する処理であるため、抽出した候補語
（句）は同一である。そこで、本発明では、文書ベクト
ル作成処理で抽出した候補語（句）をＲＡＭ１１３の所
定エリアに格納しておき、要約処理において格納した候
補語（句）を使用することでステップ２２１とステップ
２２２を省略するようにしてもよい。この候補語（句）
についても、要約対象文書Ａに対する候補語（句）とし
て文書データベース１６４、又は専用の候補語（句）デ
ータベースに格納するようにしてもよい。Further, in the described embodiment, the morphological analysis (steps 131 and 221) and the candidate word (steps 131 and 221) are performed in both the document vector creation processing (step 13 and FIG. 9) and the digest creation processing (step 22 and FIG. 10). Phrase) (steps 132 and 222). However, since the processing is for the same sentence, the extracted candidate words (phrases) are the same. Therefore, in the present invention, the candidate words (phrases) extracted in the document vector creation processing are stored in a predetermined area of the RAM 113, and the steps 221 and 222 are performed by using the candidate words (phrases) stored in the summarization processing. It may be omitted. This candidate word (phrase)
May be stored in the document database 164 or a dedicated candidate word (phrase) database as a candidate word (phrase) for the document A to be summarized.

【００３４】また、説明した実施形態ではトピックの変
わり目を判定する閾値Ｔ１、Ｔ２として予め決められた
固定値を使用するようにしたが、本発明では閾値の値を
ユーザが変更することができるようにしてもよい。ま
た、予想トピック数ｕ（固定値の閾値関数や過去の類似
文書における履歴から算出）をパラメータに取り入れた
閾値関数Ｔ１（ｕ）、Ｔ２（ｕ）を使用するようにして
もよい。In the above-described embodiment, predetermined fixed values are used as the threshold values T1 and T2 for determining a topic change. However, in the present invention, the threshold values can be changed by the user. It may be. Alternatively, threshold functions T1 (u) and T2 (u) in which the expected number of topics u (calculated from a fixed-value threshold function or a history of similar documents in the past) are used as parameters may be used.

【００３５】また説明した実施形態では、要約対象文書
ＡをＰ個のサブ文書に分割し、トピックの変わり目と予
想される区切れＸｎの仮判定刷を１回だけ行い、句切れ
がない場合（ステップ１８；Ｎ）にはトピックが複数存
在しないと判断して要約対象文書Ａに全体に対する要約
を作成する場合について説明した。しかし、あるサブ文
書Ａｎの中央に実際のトピックの変わり目が存在した場
合、そのサブ文書の文書ベクトルｂｎが中間的な値とな
り、隣接サブ文書ｂｎ−１、ｂｎ＋１との間で有為な差
が出ない、すなわち、隣接する前後のサブ文書との類似
度ｓｎ−１ｎ、ｓｎｎ＋１が閾値Ｔ１以下にならない可
能性がある。そこで、ステップ１８において句切れがな
いと判断された場合（ステップ１８；Ｎ）、サブ文書に
分割するサイズを乱数や、互いに素な数値（例えば、５
に対して１０にするのでなく４か６にするとの意味）で
少し変化させ、複数回リトライして有為な差が生じたも
のを採用するようにしてもよい。In the embodiment described above, the document A to be summarized is divided into P sub-documents, and the temporary judgment of the break Xn, which is expected to be a topic change, is performed only once. In step 18; N), a case has been described in which it is determined that a plurality of topics do not exist and a summary for the entire document A to be summarized is created. However, if an actual topic change exists at the center of a sub-document An, the document vector bn of that sub-document has an intermediate value, and a significant difference between adjacent sub-documents bn−1 and bn + 1 is found. In other words, there is a possibility that the similarities sn-1n and snn + 1 with the adjacent sub-documents before and after are not lower than the threshold T1. Therefore, if it is determined in step 18 that there is no punctuation (step 18; N), the size to be divided into sub-documents is set to a random number or a disjoint numerical value (for example, 5
(Meaning 4 or 6 instead of 10) may be slightly changed, and a retry may be performed a plurality of times to obtain a significant difference.

【００３６】説明した実施形態では、要約対象文書Ａに
複数のトピックが含まれてるか否かを判定し、その結果
を要約の作成処理に適用する場合について説明したが、
本発明ではトピックの判定結果を他に適用するようにし
てもよい。例えば、ＷＥＢのSGMLにおいてリンクを張る
場合、判定したトピック単位で特定のポインタを指すよ
うにしてもよい。また、ハイパーリンクの飛び先を判定
したトピック単位とし、ファイングレインドで指定する
ようにしてもよい。In the embodiment described above, a case has been described where it is determined whether or not a plurality of topics are included in the document A to be summarized, and the result is applied to the process of creating a summary.
In the present invention, the result of the topic determination may be applied to another. For example, when establishing a link in WEB SGML, a specific pointer may be pointed to each determined topic. Alternatively, the jump destination of the hyperlink may be set as a topic unit determined and fine-grained.

【００３７】説明した実施形態では文書ベクトルを作成
する方法として図９のフローチャートに従った方法を１
例にして説明したが、本発明でこの方法に限られるもの
ではなく、要約対象文書中Ａからキーワードを抽出する
方法や、抽出キーワードに対する重要度（＝文書ベクト
ルの要素値）の決定方法等については、公知の各種方法
により置き換えることが可能である。また、各サブ文書
群に対する要約の作成処理についても同様に図１０のフ
ローチャートに示した方法に限られるものではなく、公
知の各種要約方法、抄録作成方法等を資料することが可
能である。更に、２つの文書ベクトルの類似度の算出方
法については、数式１により類似度を算出することとし
たが、この数式に限定されるものではなく、ベクトル相
互間の類似関係を表すことが可能であれば他の数式によ
り類似度を算出することも可能である。In the embodiment described above, a method according to the flowchart of FIG.
Although described as an example, the present invention is not limited to this method. For example, a method of extracting a keyword from A to be summarized and a method of determining importance (= element value of document vector) for the extracted keyword are described. Can be replaced by various known methods. Similarly, the process of creating an abstract for each sub-document group is not limited to the method shown in the flowchart of FIG. 10, but various known methods of summarizing and preparing an abstract can be used. Furthermore, the method of calculating the similarity between two document vectors is calculated by Equation 1, but is not limited to this equation, and the similarity between vectors can be expressed. If so, it is also possible to calculate the similarity by using other mathematical expressions.

【００３８】説明した実施形態は日本語で作成された文
書に限られるものではなく、あらゆる言語で作成された
文書を対象とすることが可能である。その場合、対象と
なる文書が作成された言語用の形態素解析アルゴリズム
等を使用するといった、本発明の構成には影響のない部
分を変更するだけでよい。The embodiments described above are not limited to documents created in Japanese, but can be applied to documents created in any language. In this case, it is only necessary to change a portion that does not affect the configuration of the present invention, such as using a morphological analysis algorithm for the language in which the target document is created.

【００３９】以上の実施形態において説明した、各装
置、各部、各動作、各処理等に対しては、それらを含む
上位概念としての各手段（〜手段）により、実施形態を
構成することが可能である。例えば、「類似度ｓが閾値
Ｔ１以下の区切れＸＣが有るか否かを判断し（ステップ
１８）」との記載に対して「区切れ有無判断手段」を構
成し、「決定した重要度ｆ（ｘ）の値から要約対象文書
Ａのキーワードａ，ｂ，…を決定する（ステップ１３
４）」との記載に対して「キーワード決定手段」を構成
し、「決定したセンテンスの重要度Ｆ（ｚ）の重要度が
高いセンテンスの上位から要約パラメータの要約比率
（例えば、サブ文書群の全センテンス数の内の上位２５
％）以内に入るセンテンスをリストアップする（ステッ
プ２２５）」との記載に対して「センテンスリストアッ
プ手段」を構成するようにしてもよい。同様に、その他
各種動作に対して「〜（動作）手段」等の上位概念で実
施形態を構成するようにしてもよい。Each device, each unit, each operation, each process, and the like described in the above embodiment can be constituted by each unit as a high-level concept including these units. It is. For example, in response to the description "determining whether or not there is a break XC whose similarity s is equal to or less than the threshold value T1 (step 18)", a "break break determination means" is configured, and "the determined importance f The keywords a, b,... Of the document A to be summarized are determined from the value of (x) (step 13).
"4)", a "keyword determination means" is configured, and a summary ratio of summary parameters (for example, a sub-document group's summarization ratio from a sentence having a higher importance of the determined sentence F (z)). Top 25 of all sentences
%), "Sentence list-up means" may be configured. Similarly, the embodiment may be configured with a higher concept such as “「 (operation) means ”for various other operations.

【００４０】また第１変形として、図１１に示すよう
に、複数の文章で構成された所定形式の文書を取得する
文書取得手段１０１と、前記文書取得手段１０１で取得
された文書を複数のサブ文書に分割する文書分割手段１
０２と、前記文書分割手段１０２により分割された各サ
ブ文書について、隣接する２つのサブ文書間の類似度を
算出する類似度算出手段１０３と、前記類似度算出手段
１０３で算出された各サブ文書間の類似度からトピック
の変わり目を調べ、前記文書に複数のトピックが含まれ
るか否かを判定する判定手段１０４と、を文書処理装置
に備えさせて前記第１の目的を達成するようにしてもよ
い。第２変形として、図１２に示すように、第１変形に
記載した文書処理装置において、前記文書分割手段１０
２で分割されたサブ文書を特徴づける文書ベクトルを決
定する文書ベクトル決定手段１０５を備え、前記類似度
算出手段１０３は前記文書ベクトル決定手段１０５で決
定された各サブ文書の文書ベクトルにより隣接する２つ
のサブ文書間の類似度を算出する。このように、隣接す
る２つのサブ文書間での類似度を算出することで、ＣＰ
Ｕ１１１による処理量（計算量）を減らすことだでき、
また、１文書におけるテキストの連続性（連結性）から
もより精度の高い複数トピック検索を行うことができ
る。As a first modification, as shown in FIG. 11, a document acquisition unit 101 for acquiring a document composed of a plurality of sentences in a predetermined format, and a document acquired by the Document dividing means 1 for dividing into documents
02, similarity calculating means 103 for calculating the similarity between two adjacent sub-documents for each sub-document divided by the document dividing means 102, and each sub-document calculated by the similarity calculating means 103 The document processing apparatus is provided with determining means 104 for examining a topic change based on the similarity between them and determining whether or not the document includes a plurality of topics, so as to achieve the first object. Is also good. As a second modification, as shown in FIG. 12, in the document processing apparatus described in the first modification,
And a document vector determining unit for determining a document vector characterizing the sub-document divided by 2. The similarity calculating unit 103 includes two adjacent document vectors determined by the document vector determining unit 105. Calculate the similarity between two sub-documents. By calculating the similarity between two adjacent sub-documents, the CP
It is possible to reduce the processing amount (calculation amount) by U111,
Further, a more accurate multiple topic search can be performed based on the continuity (connectivity) of text in one document.

【００４１】ただし、本発明では、必ずしも隣接する２
つのサブ文書間での類似度を算出する必要はなく、例え
ば、隣接する３つ以上のサブ文書間の類似度を算出する
ようにしてもよい。すなわち、注目のサブ文書と、その
前後のサブ文書とで比較を行い、その結果を（同時に）
勘案して、トピックの変わり目が、１）注目サブ文書の
中か、２）その直後のサブ文書境界か、３）その直前の
サブ文書境界か、のいずれであるかを一時に判定する、
ようにしてもよい。また必ずしも隣接しないサブ文書間
での類似度を判定するようにしてもよい。この場合に
は、段落間の１、２個の改行コード、空白コードを処理
上は、サブ文書とし、この改行等によるサブ文書を飛び
越えたサブ文書間の類似度を算出する場合も含まれる。However, in the present invention, two adjacent two
It is not necessary to calculate the similarity between two sub-documents. For example, the similarity between three or more adjacent sub-documents may be calculated. In other words, the sub-document of interest is compared with the sub-documents before and after it, and the result is (simultaneously)
In consideration of this, it is determined at a time whether the change of the topic is 1) within the noted sub-document, 2) the sub-document boundary immediately after the topic, or 3) the sub-document boundary immediately before the sub-document.
You may do so. Also, the similarity between sub-documents that are not necessarily adjacent may be determined. In this case, the processing includes processing one or two line feed codes and blank codes between paragraphs as a sub-document, and calculating the similarity between sub-documents that jumps over the sub-document due to the line feed or the like.

【００４２】[0042]

【発明の効果】本発明によれば、複数の文章で構成され
た所定形式の文書を取得し、取得した文書を複数のサブ
文書に分割し、分割した各サブ文書について、隣接する
２つのサブ文書間の類似度を算出し、算出した各サブ文
書間の類似度からトピックの変わり目を調べ、文書に複
数のトピックが含まれるか否かを判定するようにしたの
で、自動的に複数のトピックが含まれているか否かを判
断することができる。従って、各トピック毎の要約を作
成したり、各トピック毎に他の文書やデータ間での関連
付けを行うことができる。According to the present invention, a document in a predetermined format composed of a plurality of sentences is obtained, the obtained document is divided into a plurality of sub-documents, and each divided sub-document is divided into two adjacent sub-documents. Since the similarity between documents was calculated, the topic change was checked from the calculated similarity between sub-documents, and it was determined whether or not the document contained multiple topics. Can be determined. Therefore, it is possible to create a summary for each topic and to associate other documents and data for each topic.

[Brief description of the drawings]

【図１】本発明の１実施形態における文書処理装置の構
成を表したブロック図である。FIG. 1 is a block diagram illustrating a configuration of a document processing apparatus according to an embodiment of the present invention.

【図２】同上、実施形態における文書ベクトルデータベ
ースの内容を概念的に表した説明図である。FIG. 2 is an explanatory diagram conceptually showing the contents of a document vector database in the embodiment.

【図３】同上、実施形態における自動要約処理のメイン
動作を表したフローチャートである。FIG. 3 is a flowchart showing a main operation of an automatic summarization process in the embodiment.

【図４】同上、実施形態における図３に示した自動要約
処理の各工程に対応する処理を概念的に表した説明図の
一部である。FIG. 4 is a part of an explanatory diagram conceptually showing processing corresponding to each step of the automatic summarization processing shown in FIG. 3 in the embodiment.

【図５】同上、実施形態における図３に示した自動要約
処理の各工程に対応する処理を概念的に表した説明図の
他の一部である。FIG. 5 is another part of the explanatory view conceptually showing processing corresponding to each step of the automatic summarization processing shown in FIG. 3 in the embodiment.

【図６】同上、実施形態における図３に示した自動要約
処理の各工程に対応する処理を概念的に表した説明図の
他の一部である。FIG. 6 is another part of the explanatory view conceptually showing processing corresponding to each step of the automatic summarization processing shown in FIG. 3 in the embodiment.

【図７】同上、実施形態における図３に示した自動要約
処理の各工程に対応する処理を概念的に表した説明図の
他の一部である。FIG. 7 is another part of the explanatory view conceptually showing processing corresponding to each step of the automatic summarization processing shown in FIG. 3 in the embodiment.

【図８】同上、実施形態における図３に示した自動要約
処理の各工程に対応する処理を概念的に表した説明図の
他の一部である。FIG. 8 is another part of the explanatory view conceptually showing processing corresponding to each step of the automatic summarization processing shown in FIG. 3 in the embodiment.

【図９】同上、実施形態における文書ベクトル作成処理
の動作を表したフローチャートである。FIG. 9 is a flowchart illustrating an operation of a document vector creation process according to the embodiment.

【図１０】同上、実施形態における要約作成処理の動作
を表したフローチャートである。FIG. 10 is a flowchart illustrating an operation of a summary creation process in the embodiment.

【図１１】請求項１に記載した発明のクレーム対応図で
ある。FIG. 11 is a diagram corresponding to claims of the invention described in claim 1;

【図１２】請求項２に記載した発明のクレーム対応図で
ある。FIG. 12 is a diagram corresponding to claims of the invention described in claim 2;

【図１３】請求項４に記載した発明のクレーム対応図の
１例である。FIG. 13 is an example of a claim correspondence diagram of the invention described in claim 4;

【図１４】請求項５に記載した発明のクレーム対応図の
１例である。FIG. 14 is an example of a claim correspondence diagram of the invention described in claim 5;

【図１５】請求項７に記載した発明のクレーム対応図で
ある。FIG. 15 is a diagram corresponding to claims of the invention described in claim 7;

【図１６】請求項８に記載した発明のクレーム対応図で
ある。FIG. 16 is a diagram corresponding to claims of the invention described in claim 8;

【図１７】請求項１０に記載した発明のクレーム対応図
の１例である。FIG. 17 is an example of a claim correspondence diagram of the invention described in claim 10;

【図１８】請求項１１に記載した発明のクレーム対応図
の１例である。FIG. 18 is an example of a claim correspondence diagram of the invention described in claim 11;

【図１９】請求項１３に記載した発明のクレーム対応図
である。FIG. 19 is a diagram corresponding to claims of the invention described in claim 13;

【図２０】請求項１４に記載した発明のクレーム対応図
である。FIG. 20 is a view corresponding to claims of the invention described in claim 14;

【図２１】請求項１５に記載した発明のクレーム対応図
である。FIG. 21 is a view corresponding to claims of the invention described in claim 15;

[Explanation of symbols]

１１制御部１１２ＲＯＭ１１３ＲＡＭ１１３１要約対象文書格納エリア１１３２要約パラメータ格納エリア１１３３区切れ位置格納エリア１１３４文書ベクトル格納エリア１１３５要約格納エリア１２キーボード１３マウス１４表示装置１５印刷装置１６記憶装置１６１仮名漢字変換辞書１６２プログラム格納部１６３データ格納部１６４文書データベース１６５要約データベース１６６文書ベクトルデータベース１７記憶媒体駆動装置１８通信制御装置１９入出力Ｉ／Ｆ１０１文書取得手段１０２文書分割手段１０３類似度算出手段１０４判定手段１０５文書ベクトル決定手段１０６要約作成手段１０７関連付け手段２０１文書取得機能２０２文書分割機能２０３類似度算出機能２０４判定機能２０５文書ベクトル決定機能２０６要約作成機能２０７関連付け機能 11 Controller 112 ROM 113 RAM 1131 Summarization target document storage area 1132 Summarization parameter storage area 1133 Break position storage area 1134 Document vector storage area 1135 Summary storage area 12 Keyboard 13 Mouse 14 Display device 15 Printer 16 Storage device 161 Kana-kanji conversion Dictionary 162 Program storage 163 Data storage 164 Document database 165 Summary database 166 Document vector database 17 Storage medium drive 18 Communication controller 19 Input / output I / F 101 Document acquisition unit 102 Document division unit 103 Similarity calculation unit 104 Judgment unit Reference Signs List 105 Document vector determination means 106 Abstract creation means 107 Association means 201 Document acquisition function 202 Document division function 203 Similarity calculation function 204 Judgment Function 205 document vector decision section 206 summarization function 207 associated function

Claims

[Claims]

1. A document acquisition unit for acquiring a document of a predetermined format composed of a plurality of sentences, a document division unit for dividing a document acquired by the document acquisition unit into a plurality of sub-documents, and the document division unit A similarity calculating unit for calculating a similarity between the sub-documents divided according to: and a similarity between the sub-documents calculated by the similarity calculating unit, and determining whether the document includes a plurality of topics. A document processing apparatus comprising:

2. A document vector determining means for determining a document vector characterizing a sub-document divided by the document dividing means, wherein the similarity calculating means determines a document vector of the sub-document determined by the document vector determining means. 2. The document processing apparatus according to claim 1, wherein the similarity between the sub-documents is calculated by the following.

3. The determining means tentatively determines a topic change from the similarity between the sub-documents calculated by the similarity calculating means, and the similarity calculating means tentatively determines the topic determined by the determining means. Further, the similarity between the sub-document groups re-divided at the transition point is calculated, and the determination unit includes a plurality of topics in the document based on the similarity between the sub-document groups calculated by the similarity calculation unit. The document processing apparatus according to claim 1, wherein it is determined whether or not the document processing is performed.

4. A summary creating means for automatically creating a summary of a document composed of a plurality of documents, wherein the summary creating means is determined by the determining means to include a plurality of topics in the document. 4. The document processing apparatus according to claim 1, wherein an abstract is created in a unit constituting a topic.

5. An associating means for associating predetermined data with other data, wherein the associating means associates with other data in units constituting a topic determined by the determining means. The document processing apparatus according to any one of claims 1 to 4, wherein

6. When the determining unit determines that a plurality of topics are not included, the dividing unit divides the document into sub-documents having different sizes, and the similarity calculating unit calculates the adjacent sub-documents after the re-dividing. Recalculating the similarity between the documents, wherein the determination unit examines a topic change from the recalculated similarity and re-determines whether the document includes a plurality of topics. The document processing device according to any one of claims 1 to 5.

7. A document acquisition function for acquiring a document of a predetermined format composed of a plurality of sentences, a document division function for dividing a document acquired by the document acquisition function into a plurality of sub-documents, and a document division function. A similarity calculation function for calculating the similarity between the sub-documents divided by the sub-document, and determining whether the document includes a plurality of topics from the similarity between the sub-documents calculated by the similarity calculation function And a computer-readable document processing program for causing a computer to implement the determination function.

8. A document vector determining function for determining a document vector characterizing a sub-document divided by the document dividing function, wherein the similarity calculating function is a document vector of the sub-document determined by the document vector determining function. A storage medium storing the document processing program according to claim 7, wherein the similarity between the sub-documents is calculated by the following.

9. The determination function tentatively determines a topic change based on the similarity between sub-documents calculated by the similarity calculation function, and the similarity calculation function determines the topic tentatively determined by the determination function. Further, the similarity between the sub-document groups re-divided at the transition of is calculated, and the determining function includes a plurality of topics in the document from the similarity between the sub-document groups calculated by the similarity calculating function. 9. A storage medium storing the document processing program according to claim 7, wherein it is determined whether the document processing program is executed.

10. A summary creation function for automatically creating a summary of a document composed of a plurality of documents, wherein the summary creation function is determined by the determination function to include a plurality of topics in the document. 10. The storage medium storing the document processing program according to claim 7, wherein an abstract is created in a unit constituting a topic.

11. An associating function for associating predetermined data with other data, wherein the associating function is to associate with other data in units constituting a topic determined by the determining function. A storage medium storing the document processing program according to any one of claims 7 to 10.

12. When the determining function determines that a plurality of topics are not included, the dividing function re-divides the document into sub-documents of different sizes, and the similarity calculating function performs the re-dividing of the adjacent sub-documents. Recalculating the similarity between the documents, the determining function examines a topic change from the similarity after the recalculation, and redetermines whether or not the document includes a plurality of topics. A storage medium storing the document processing program according to any one of claims 7 to 11.

13. A document in a predetermined format composed of a plurality of sentences is obtained, the obtained document is divided into a plurality of sub-documents, a similarity between the divided sub-documents is calculated, and the calculated A document processing method comprising: determining whether or not a plurality of topics are included in the document based on similarity.

14. The document according to claim 13, wherein the similarity between adjacent sub-documents is determined by determining a document vector characterizing the divided sub-document and using the determined document vector of the sub-document. Processing method.

15. The document processing method according to claim 13, wherein when it is determined that a plurality of topics are included in the document, an abstract is created in units of the topics.