JPH10320409A

JPH10320409A - Method and device for extracting document information and storage medium storing document extracting process program

Info

Publication number: JPH10320409A
Application number: JP9128986A
Authority: JP
Inventors: Shinji Miwa; 真司三輪
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1997-05-19
Filing date: 1997-05-19
Publication date: 1998-12-04

Abstract

PROBLEM TO BE SOLVED: To extract a group of document contents as a context since it is hard to extract part of a document or find differences between two documents, if a document is fractionized too much. SOLUTION: Paragraphs are detected from the document contents of one document, the document contents are divided by the paragraphs, and a morpheme analysis is carried out by the paragraphs. Then featured elements are extracted on the basis of the morpheme analytic results (step s1) and a feature table is generated which shows the relation ship between the featured elements and the paragraphs including the featured elements (step s2). On the basis of this feature table, the document is classified by contents as meaningful groups (step s3) and when a content selection indication is received from a user (step s4), the document contents of the paragraph belonging to the selected content are outputted (steps s5 and s6).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、たとえば、文書内
容の或る一部を抽出したり、文書間の差分を取ったりす
る場合に処理を行うのに適した文書情報抽出方法及び装
置並びに文書情報抽出処理プログラムを記憶した記憶媒
体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document information extracting method and apparatus suitable for performing a process for extracting a part of a document content or obtaining a difference between documents. The present invention relates to a storage medium storing an information extraction processing program.

【０００２】[0002]

【従来の技術】たとえば、２つの文書内容の差分を取っ
たり、或る文書内容の一部を抽出したりする際、文書デ
ータを蓄積したデータベースなどからユーザの要求する
文書内容を適度な単位で効率よく抽出することが必要と
なる。2. Description of the Related Art For example, when calculating a difference between two document contents or extracting a part of a certain document content, a document content requested by a user from a database or the like in which document data is stored is stored in an appropriate unit. It is necessary to extract efficiently.

【０００３】このように、ある特定の部分だけを抽出す
るためには、文書データを決められた規則に基づいて区
切って抽出し、その抽出した部分の情報を表示する必要
がある。As described above, in order to extract only a specific portion, it is necessary to extract document data by dividing it based on a predetermined rule and display information on the extracted portion.

【０００４】インターネット上において、文書のある一
部の文書内容を抽出する場合、たとえば、ＨＴＭＬ（Ｈ
ｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）
などの文書構造記述言語により書かれた文書であれば、
その記述言語に使用されている区切り記号などを用いて
切り分けを行ったり、あるいは、自然言語で書かれた文
書であれば、行頭のインデントや空行などによって切り
分けたりすることが考えられる。このような切り分けの
単位をどのようにするかの情報は、予めシステム内に記
憶させておくことで、文書をその記憶内容に従って、切
り分けることは可能である。[0004] On the Internet, when extracting the contents of a part of a document, for example, HTML (H
hyper Text Markup Language)
If the document is written in a document structure description language such as
It is conceivable to perform separation using a delimiter or the like used in the description language, or, for a document written in a natural language, to separate the document by indentation at the beginning of a line or a blank line. By storing information on how such a unit of division is stored in the system in advance, it is possible to separate a document according to the stored contents.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、様々な
文書に対応できるように、様々な切り分け方を記憶して
おいて、それを文書の種類や文書内容に関係なく処理対
象文書すべてに適用すると、文書によっては、切り分け
られた内容が細かくなりすぎたり、切り分けない方がよ
いところで切り分けられてしまうことがある。However, when various kinds of divisions are stored so as to be applicable to various kinds of documents and applied to all the documents to be processed irrespective of the type and content of the documents, Depending on the document, the cut contents may be too fine, or may be cut where it is better not to cut.

【０００６】特に、オンライン情報において、更新され
た部分と更新前の情報との差分を取るような場合は、あ
まり細かく切り分けて抽出されたり、切ってはいけない
ところで切り分けられると、その内容の前後関係がわか
らなかったりして、意味が理解できなくなることもあ
る。In particular, in a case where a difference between an updated part and information before an update is obtained in online information, if the information is extracted by dividing it into very small pieces, or if it is cut in a place where it cannot be cut, the context of the content may be reduced. Sometimes you don't understand, and you can't understand the meaning.

【０００７】そこで本発明は、文書から区切り記号や書
式によって切り分けられる切り分け単位（本文中これを
段落と呼ぶ）を検出し、その段落ごとに内容の特徴を表
す特徴要素を抽出し、それぞれの特徴要素にもとづい
て、それぞれの段落を意味的に類似の段落群（本文中こ
れをコンテンツと呼ぶ）に分類して出力する。また、複
数のコンテンツに分類して表示し、ユーザがコンテンツ
を指定することによりそのコンテンツに属する文書内容
を出力することを可能とする。さらに、このようなコン
テンツについて、さらにコンテンツ内での再分類を行う
ことも可能である。これらによって、２つの文書の差分
を取ったり、ある文書の一部を抽出したりする処理を行
うのにきわめて好都合な文書情報抽出方法および文書情
報抽出装置を提供することを目的としている。Accordingly, the present invention detects a segmentation unit (hereinafter referred to as a paragraph) which is segmented by a delimiter or a format from a document, extracts a characteristic element representing the characteristic of the content for each paragraph, and extracts each characteristic. Based on the elements, each paragraph is classified and output as a group of paragraphs that are semantically similar (this is referred to as content in the text). In addition, it is possible to classify and display a plurality of contents, and to output a document content belonging to the content when the user designates the content. Further, it is possible to further re-classify such content in the content. Accordingly, an object of the present invention is to provide a document information extraction method and a document information extraction apparatus which are extremely convenient for performing a process of calculating a difference between two documents or extracting a part of a certain document.

【０００８】[0008]

【課題を解決するための手段】本発明の情報検索方法に
おける請求項１の発明は、ある文書の文書内容から段落
を検出し、その段落ごとに内容の特徴を表す特徴要素を
抽出し、その特徴要素とその特徴要素を含む段落との関
係を表す特徴テーブルを作成して、その特徴テーブルに
基づいて、前記文書を意味的なまとまりごとの複数のコ
ンテンツに分類して出力することを特徴としている。According to the first aspect of the information retrieval method of the present invention, a paragraph is detected from the document content of a certain document, and a characteristic element representing the characteristic of the content is extracted for each paragraph. A feature table that represents a relationship between a feature element and a paragraph including the feature element is created, and the document is classified into a plurality of contents for each meaningful unit and output based on the feature table. I have.

【０００９】また、請求項２の発明は、請求項１の発明
において、前記文書を意味的なまとまりごとのコンテン
ツに分類して出力する処理は、前記特徴テーブルに基づ
いて、文書を意味的なまとまりごとの複数のコンテンツ
に分類して表示し、ユーザからのコンテンツ選択指示を
うけたとき、その選択されたコンテンツに属する内容を
出力する。According to a second aspect of the present invention, in the first aspect of the present invention, the process of classifying and outputting the document into contents in a meaningful unit is performed based on the feature table. The contents are classified into a plurality of contents and displayed, and when a content selection instruction is received from the user, the contents belonging to the selected contents are output.

【００１０】また、請求項３の発明は、請求項１または
２の発明において、前記特徴テーブルに基づいて文書を
意味的なまとまりごとの複数のコンテンツに分類する処
理は、各段落に存在する特徴要素に基づいて、共通する
特徴要素を持つ複数の段落を１つのまとまりとし、その
まとまりをコンテンツとする。According to a third aspect of the present invention, in the first or second aspect of the present invention, the processing for classifying a document into a plurality of contents for each meaningful unit based on the feature table is performed in each paragraph. Based on the element, a plurality of paragraphs having a common characteristic element are made into one unit, and the unit is set as content.

【００１１】また、請求項４の発明は、請求項２の発明
において、前記意味的なまとまりごとのコンテンツに分
類して表示する際の表示内容は、それぞれのコンテンツ
ごとに、少なくとも、そのコンテンツを代表する特徴要
素とする。According to a fourth aspect of the present invention, in the second aspect of the invention, the display content when the content is classified and displayed in the semantic unit is at least for each content. This is a representative feature element.

【００１２】また、請求項５の発明は、請求項２の発明
において、前記意味的なまとまりごとのコンテンツに分
類して表示する際の表示内容は、それぞれのコンテンツ
ごとに、少なくとも、そのコンテンツを代表する特徴要
素と、そのコンテンツに含まれる段落に関するデータと
する。According to a fifth aspect of the present invention, in the second aspect of the present invention, the display content when the content is classified and displayed in the semantic unit is at least for each content. It is assumed that the representative characteristic element and the data related to the paragraph included in the content are used.

【００１３】また、本発明の文書情報抽出装置における
請求項６の発明は、文書群を記憶する文書群記憶部と、
この文書群記憶部に記憶されている文書に対し、その文
書内容から段落を検出し、その段落ごとに形態素解析を
行う文解析部と、この文解析部による解析結果から前記
各段落ごとに内容の特徴を表す特徴要素を抽出し、その
特徴要素とその特徴要素を含む段落との関係を表す特徴
テーブルを作成する特徴テーブル作成部と、前記特徴テ
ーブルの内容に基づいて前記文書を意味的なまとまりご
との複数のコンテンツに分類する段落分類部と、この段
落分類部により分類された内容を記憶する分類結果記憶
部と、この分類結果記憶部の内容を読み出して出力する
出力制御部とを有することを特徴としている。[0013] Further, according to the invention of claim 6 in the document information extracting apparatus of the present invention, a document group storage unit for storing a document group;
For a document stored in the document group storage unit, a sentence analysis unit that detects a paragraph from the document content and performs a morphological analysis for each paragraph, and a content for each of the paragraphs based on the analysis result by the sentence analysis unit A feature table creating unit that extracts a feature element representing the feature of the feature and creates a feature table that represents a relationship between the feature element and a paragraph including the feature element; It has a paragraph classifying unit for classifying into a plurality of contents for each unit, a classification result storage unit for storing the contents classified by the paragraph classifying unit, and an output control unit for reading and outputting the contents of the classification result storage unit. It is characterized by:

【００１４】この分類結果記憶部の内容を読み出して前
記複数のコンテンツを表示させる制御を行うとともに、
ユーザからのコンテンツ選択指示を受けたとき、その選
択されたコンテンツに属する切り分け単位の内容の表示
を行う表示制御部とを有することを特徴としている。The contents of the classification result storage section are read and the plurality of contents are displayed.
And a display control unit that, when receiving a content selection instruction from the user, displays the contents of the division unit belonging to the selected content.

【００１５】また、請求項７の発明は、請求項６の発明
において、前記出力制御部は、前記複数のコンテンツを
表示させる制御を行なうとともに、ユーザからのコンテ
ンツ選択指示を受けたとき、その選択されたコンテンツ
に属する段落の内容を出力する。According to a seventh aspect of the present invention, in the sixth aspect of the invention, the output control section controls the display of the plurality of contents and, when receiving a content selection instruction from a user, selects the content. Output the contents of the paragraph belonging to the selected content.

【００１６】また、請求項８の発明は、請求項６または
７の発明において、前記段落分類部が行う特徴テーブル
に基づいて文書を意味的なまとまりごとのコンテンツに
分類する処理は、各段落の特徴要素に基づいて、共通す
る特徴要素を持つ段落を１つのまとまりとし、そのまと
まりをコンテンツとする。According to the invention of claim 8, in the invention of claim 6 or 7, the process of classifying a document into contents of a semantic unit based on the feature table performed by the paragraph classifying unit is performed in each paragraph. Based on the characteristic elements, paragraphs having common characteristic elements are grouped into one, and the group is defined as content.

【００１７】また、請求項９の発明は、請求項７の発明
において、前記意味的なまとまりごとのコンテンツに分
類して表示する際の表示内容は、それぞれのコンテンツ
ごとに、少なくとも、そのコンテンツを代表する特徴要
素としている。According to a ninth aspect of the present invention, in the invention of the seventh aspect, the display contents when the content is classified and displayed in the semantic unit are at least for each content. This is a representative feature element.

【００１８】また、請求項１０の発明は、請求項７の発
明において、前記意味的なまとまりごとのコンテンツに
分類して表示する際の表示内容は、それぞれのコンテン
ツごとに、少なくとも、そのコンテンツを代表する特徴
要素と、そのコンテンツに含まれる段落に関するデータ
としている。According to a tenth aspect of the present invention, in the seventh aspect of the present invention, the display contents when the content is classified and displayed in the semantic unit are at least for each content. It is data on representative characteristic elements and paragraphs included in the content.

【００１９】さらに、本発明の文書情報抽出処理プログ
ラムを記憶した記憶媒体における請求項１１の発明は、
コンピュータによって文書抽出処理を行う処理プログラ
ムを記載した記憶媒体であって、その処理プログラム
は、ある文書の文書内容から段落を検出し、その段落ご
とに内容の特徴を表す特徴要素を抽出し、その特徴要素
とその特徴要素を含む段落との関係を表す特徴テーブル
を作成して、その特徴テーブルに基づいて、前記文書を
意味的なまとまりごとの複数のコンテンツに分類して出
力することを特徴としている。Further, the invention according to claim 11 in a storage medium storing a document information extraction processing program of the present invention,
A storage medium describing a processing program for performing a document extraction process by a computer, wherein the processing program detects a paragraph from the document content of a certain document, and extracts a characteristic element representing a feature of the content for each paragraph. A feature table that represents a relationship between a feature element and a paragraph including the feature element is created, and the document is classified into a plurality of contents for each meaningful unit and output based on the feature table. I have.

【００２０】また、請求項１２の発明は、請求項１１の
記憶媒体における前記文書を意味的なまとまりごとの複
数のコンテンツに分類して出力する処理は、前記特徴テ
ーブルに基づいて、文書を意味的なまとまりごとの複数
のコンテンツに分類して表示し、ユーザからのコンテン
ツ選択指示を受けたとき、その選択されたコンテンツに
属する段落の内容を出力する処理であることを特徴とし
ている。According to a twelfth aspect of the present invention, in the storage medium according to the eleventh aspect, the process of classifying the document into a plurality of contents in a meaningful unit and outputting the content is performed based on the feature table. It is characterized in that it is a process of classifying and displaying a plurality of contents for each summary, and outputting a content of a paragraph belonging to the selected content when receiving a content selection instruction from a user.

【００２１】また、請求項１３の発明は、請求項１２の
記憶媒体における前記特徴テーブルに基づいて文書を意
味的なまとまりごとの複数のコンテンツに分類する処理
は、各段落の特徴要素に基づいて、共通する特徴要素を
持つ複数の段落を１つのまとまりとし、そのまとまりを
コンテンツとする処理であることを特徴としている。According to a thirteenth aspect of the present invention, in the storage medium of the twelfth aspect, the process of classifying a document into a plurality of contents in each meaningful unit based on the characteristic table is performed based on a characteristic element of each paragraph. It is characterized in that a plurality of paragraphs having a common characteristic element are processed into one unit, and the unit is used as a content.

【００２２】また、請求項１４の発明は、請求項１２の
記憶媒体による表示処理によって表示される内容（意味
的なまとまりごとのコンテンツに分類して表示する際の
表示内容）は、それぞれのコンテンツごとに、少なくと
も、そのコンテンツを代表する特徴要素であることを特
徴としている。According to a fourteenth aspect of the present invention, the contents displayed by the display processing by the storage medium according to the twelfth aspect (display contents when the contents are classified and displayed in a semantic unit) are each content. Each is characterized by being at least a characteristic element representing the content.

【００２３】また、請求項１５の発明は、請求項１２の
記憶媒体による表示処理によって表示される内容（意味
的なまとまりごとのコンテンツに分類して表示する際の
表示内容）は、それぞれのコンテンツごとに、少なくと
も、そのコンテンツを代表する特徴要素と、そのコンテ
ンツに含まれる段落に関するデータであることを特徴と
している。According to a fifteenth aspect of the present invention, the contents displayed by the display processing by the storage medium according to the twelfth aspect (display contents when the contents are classified and displayed in a semantic unit) are each content. Each is characterized by being at least data relating to a characteristic element representing the content and a paragraph included in the content.

【００２４】本発明は、文書から段落を抽出し、その段
落ごとに内容の特徴を表す特徴要素を抽出し、前記文書
を意味的なまとまりごとのコンテンツに分類して出力す
る。また、文書を意味的なまとまりごとの複数のコンテ
ンツに分類して表示し、ユーザからのコンテンツ選択指
示を受けたとき、その選択されたコンテンツに属する内
容を出力するようにしている。According to the present invention, a paragraph is extracted from a document, a characteristic element representing the characteristic of the content is extracted for each paragraph, and the document is classified and output as contents in a meaningful unit. Further, the document is classified and displayed in a plurality of contents for each semantic unit, and when a content selection instruction is received from the user, the contents belonging to the selected content are output.

【００２５】このように、本発明では、段落を内容的な
まとまりであるコンテンツにまとめて出力するため、区
切り記号や書式による切り分けで過剰に切り分けてしま
っても、内容によって再結合することができる。このた
め、文書の内容や種類に関係なく段落の切り分け方を設
定できる。また、段落検出は比較的に容易に行えるの
で、効率的な切り分けが行える。As described above, according to the present invention, since paragraphs are collectively output as content that is a unit of content, even if the paragraphs are excessively separated by a delimiter or a format, they can be recombined depending on the content. . For this reason, it is possible to set the paragraph division method regardless of the content and type of the document. Also, since paragraph detection can be performed relatively easily, efficient segmentation can be performed.

【００２６】また、文書を意味的なまとまりごとのコン
テンツに分類する処理は、それぞれの段落の特徴要素に
基づいて、共通する特徴要素を持つ段落をコンテンツと
して分類するようにしている。Further, in the process of classifying a document into contents of a semantic unit, paragraphs having common characteristic elements are classified as contents based on characteristic elements of respective paragraphs.

【００２７】これにより、文書の切り分け単位は段落単
位といった比較的小さな単位であっても、意味的なまと
まりを有する複数の段落で構成されるコンテンツとして
分類されるので、処理対象の文書が細かく切り分けられ
過ぎるようなことが無くなる。With this, even if the unit of document separation is a relatively small unit such as a paragraph unit, it is classified as a content composed of a plurality of paragraphs having a semantic unit, so that the document to be processed is finely divided. No more being overkilled.

【００２８】また、文書を複数のコンテンツに分類して
表示し、ユーザがコンテンツを指定することによりその
コンテンツに属する文書内容を出力することを可能とす
る。この複数のコンテンツを表示する際の表示内容は、
それぞれのコンテンツごとに、少なくともそのコンテン
ツを代表する特徴要素、または、そのコンテンツを代表
する特徴要素と、そのコンテンツに含まれる段落に関す
るデータからなっているので、画面上からそれぞれのコ
ンテンツ内容を把握し易くなり、所望のコンテンツの選
択を効率よく行える。また、ユーザからのコンテンツ選
択指示を受けたとき、その選択されたコンテンツに属す
る段落を出力することにより、後段の処理やユーザの閲
覧にあたって不要な部分を省き、有用な情報に効率よく
絞り込むことができる。Further, it is possible to classify and display the document into a plurality of contents, and to output the contents of the document belonging to the contents by designating the contents by the user. When displaying these multiple contents,
Each content consists of at least the characteristic element representing the content or the characteristic element representing the content, and the data related to the paragraphs included in the content. This makes it easier to select desired contents efficiently. In addition, when a content selection instruction is received from the user, by outputting paragraphs belonging to the selected content, unnecessary portions can be omitted for subsequent processing and user browsing, and useful information can be efficiently narrowed down. it can.

【００２９】これにより、たとえば、コンテンツ単位で
何らかの処理を行うような場合（たとえば、オンライン
情報の、更新された部分と更新前の情報との差を見ると
いうような場合）、意味的なまとまりを有する複数の段
落で構成されるコンテンツ単位での表示がなされるの
で、その内容の前後関係や大局的な内容を理解するのに
きわめて好都合となる。Thus, for example, when some processing is performed for each content (for example, when a difference between an updated part of online information and information before update is viewed), a semantic unity is obtained. Since the display is performed in a unit of content composed of a plurality of paragraphs, it is very convenient to understand the context of the content and the overall content.

【００３０】[0030]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００３１】図１は本発明を実現するための装置構成を
示す図であり、文書群記憶部１１、文解析部１２、特徴
テーブル作成部１３、段落分類部１４、分類結果記憶部
１５、表示制御部１６、表示部１７などから構成されて
いる。FIG. 1 is a diagram showing an apparatus configuration for realizing the present invention. A document group storage unit 11, a sentence analysis unit 12, a feature table creation unit 13, a paragraph classification unit 14, a classification result storage unit 15, a display It comprises a control unit 16, a display unit 17, and the like.

【００３２】文書群記憶部１１は、ある文書群に含まれ
る多数の文書をデータベースとして記憶するものであ
る。The document group storage unit 11 stores a large number of documents included in a certain document group as a database.

【００３３】前記文書群記憶部１１に格納されている多
数の文書群のうちの或る１つの文書として、たとえば、
図２に示されるように、「人工知能に関する論文群」が
あるとすると、その「人工知能に関する論文群」に属す
る論文として、たとえば、「エキスパートシステムに関
する論文」、「自然言語処理に関する論文」、「ニュー
ラルネットワークに関する論文」などがあり、さらに、
たとえば、「エキスパートシステムに関する論文」の中
には、「工場制御に関する論文」、「市場に関する論
文」というように、ある１つの文書には、それに関連す
る多くの文書内容が存在しているものとする。As a certain one of a large number of documents stored in the document group storage unit 11, for example,
As shown in FIG. 2, if there is a “paper group on artificial intelligence”, as papers belonging to the “paper group on artificial intelligence”, for example, “papers on expert systems”, “papers on natural language processing”, "Articles on Neural Networks" and more.
For example, in a "paper on an expert system", there is a "paper on factory control" and a "paper on the market". I do.

【００３４】文解析部１２は、文書群記憶部１１に記憶
されている文書を基に、それぞれの文書について段落を
検出し、また、形態素解析を行う。The sentence analysis unit 12 detects paragraphs of each document based on the documents stored in the document group storage unit 11 and performs morphological analysis.

【００３５】なお、段落を検出する手段としては、たと
えば、文書中に空行が存在しているか否かを検出して空
行が有ると判定した場合には、その空行部分を段落境界
とみなしたり、前後の行頭を比較し行頭が１文字分の空
きがあればそれを段落境界とみなす方法、あるいは、前
述のＨＴＭＬのような記述言語であれば、段落記号を検
出することで段落境界を検出するなど幾つかの方法が考
えられる。As means for detecting a paragraph, for example, if a blank line exists in a document and it is determined that there is a blank line, the blank line portion is regarded as a paragraph boundary. Considering the beginning or end of a line and comparing it to the beginning of a line, if there is an empty space for one character, consider it as a paragraph boundary, or, if it is a description language such as HTML described above, detect a paragraph symbol to detect a paragraph boundary. There are several methods, such as detecting an error.

【００３６】図３は図２で示したある１つの文書の段落
構成例を示すものであり、説明を簡単にするために、こ
こでは、その文書は６つの段落が存在しているものとす
る。以下、これら６つの単位を段落Ａ１〜段落Ａ６と呼
ぶことにする。FIG. 3 shows an example of the paragraph configuration of one document shown in FIG. 2. For simplicity of description, it is assumed here that the document has six paragraphs. . Hereinafter, these six units will be referred to as paragraphs A1 to A6.

【００３７】そして、段落Ａ１には主な形態素として、
「エキスパートシステム」、「自然言語」、「ニューラ
ルネットワーク」などが存在し、段落Ａ２には主な形態
素として、「エキスパートシステム」などが存在し、段
落Ａ３には主な形態素として、「エキスパートシステ
ム」、「自然言語」などが存在し、段落Ａ４には主な形
態素として、「自然言語」などが存在し、段落Ａ５には
主な形態素として、「ニューラルネットワーク」などが
存在し、段落Ａ６には主な形態素として、「自然言
語」、「ニューラルネットワーク」などが存在している
ものとする。In the paragraph A1, the main morphemes are:
"Expert system", "natural language", "neural network", etc. exist. Paragraph A2 includes "expert system" as a main morpheme, and paragraph A3 includes "expert system" as a main morpheme. , "Natural language", etc., in paragraph A4, "natural language" etc. exists as a main morpheme, in paragraph A5, "neural network" exists as a main morpheme, and in paragraph A6, It is assumed that “natural language”, “neural network”, and the like exist as main morphemes.

【００３８】特徴テーブル作成部１３は、特徴要素抽出
部１３１、特徴抽出部１３２、特徴テーブル１３３から
構成される。特徴要素抽出部１３１は、文解析部１２で
形態素解析されて抽出された情報を基に、それぞれの段
落の内容を代表する情報（特徴要素という）を抽出す
る。The feature table creation unit 13 includes a feature element extraction unit 131, a feature extraction unit 132, and a feature table 133. The characteristic element extracting unit 131 extracts information (referred to as characteristic element) representing the content of each paragraph based on the information extracted by the morphological analysis in the sentence analyzing unit 12.

【００３９】特徴抽出部１３２は特徴要素抽出部１３１
によって得られた特徴要素に基づいて、たとえば、図２
で示すような文書において、どの特徴要素がどの段落に
何回出現したかをカウントする。The feature extraction unit 132 is a feature element extraction unit 131
For example, based on the characteristic elements obtained by
In a document such as that shown by, the number of times which feature element appears in which paragraph and how many times is counted.

【００４０】そして、特徴要素抽出部１３１によって抽
出された特徴要素と、特徴抽出部１３２でカウントされ
た数とにより、図４のような特徴テーブル１３３が作成
される。Then, a feature table 133 as shown in FIG. 4 is created from the feature elements extracted by the feature element extraction unit 131 and the numbers counted by the feature extraction unit 132.

【００４１】図４に示される特徴テーブル１３３の例
は、特徴要素としては、前述した段落Ａ１〜Ａ６に出現
した主な形態素である「エキスパートシステム」、「自
然言語」、「ニューラルネットワーク」が示されてい
る。In the example of the feature table 133 shown in FIG. 4, as the feature elements, "expert system", "natural language", and "neural network" which are the main morphemes appearing in the paragraphs A1 to A6 described above are shown. Have been.

【００４２】そして、「エキスパートシステム」という
特徴要素は、段落Ａ１には２回出現し、段落Ａ２と段洛
Ａ３にはそれぞれ１回ずつ出現し、段落Ａ４，Ａ５，Ａ
６の出現数はそれぞれ０回であることを示している。ま
た、「自然言語」という特徴要素は、段落Ａ１，Ａ３，
Ａ６，Ａ４にそれぞれ１回ずつ出現し、段落Ａ２，Ａ５
の出現回数はそれぞれ０回であることを示している。ま
た、「ニューラルネットワーク」という特徴要素は、段
落Ａ１，Ａ５，Ａ６にはそれぞれ１回ずつ出現し、段落
Ａ２，Ａ３，Ａ４への出現回数はそれぞれ０回であるこ
とを示している。The characteristic element "expert system" appears twice in paragraph A1, once in paragraph A2 and once in paragraph A3, and in paragraphs A4, A5, A
6 indicates that the number of appearances is 0 each. In addition, the characteristic element “natural language” includes paragraphs A1, A3,
Appear once in A6 and A4, respectively, in paragraphs A2 and A5.
Indicates that the number of occurrences of each is zero. The characteristic element "neural network" appears once in each of the paragraphs A1, A5, and A6, and the number of appearances in each of the paragraphs A2, A3, and A4 is zero.

【００４３】このように、特徴テーブル１３３には、ど
の特徴要素がどの段落にどのくらい出現しているかが示
されている。As described above, the feature table 133 indicates which feature element appears in which paragraph and how many times.

【００４４】段落分類部１４は、このような内容の特徴
テーブル１３３を参照し、それぞれの段落における特徴
要素の出現頻度などに基づいて段落Ａ１〜Ａ６を複数の
コンテンツに分類する。The paragraph classifying unit 14 classifies the paragraphs A1 to A6 into a plurality of contents based on the appearance frequency of the characteristic element in each paragraph with reference to the characteristic table 133 having such contents.

【００４５】たとえば、コンテンツ１は、特徴要素とし
て「エキスパートシステム」を含む段落を１つのまとま
りとしている。なお、その「エキスパートシステム」と
いう特徴要素を含む段落は段落Ａ１，Ａ２，Ａ３である
ことから、該当する段落数は「３」である。For example, the content 1 is a set of paragraphs including “expert system” as a characteristic element. Note that the paragraphs including the feature element of “expert system” are paragraphs A1, A2, and A3, so the number of corresponding paragraphs is “3”.

【００４６】また、コンテンツ２は、特徴要素として
「自然言語」を含む段落を１つのまとまりとしている。
なお、その「自然言語」という特徴要素を含む段落はＡ
１，Ａ３，Ａ４，Ａ６であることから、該当する段落数
は「４」である。In the content 2, paragraphs containing "natural language" as a characteristic element are grouped together.
The paragraph containing the characteristic element of “natural language” is A
Since the numbers are 1, A3, A4, and A6, the number of corresponding paragraphs is “4”.

【００４７】さらに、コンテンツ３は、特徴要素として
「ニューラルネットワーク」を含む段落を１つのまとま
りとしている。なお、その「ニューラルネットワーク」
という特徴要素を含む段落は段落Ａ１，Ａ５，Ａ６であ
ることから、該当する段落数は「３」である。Further, the content 3 is composed of a paragraph including a "neural network" as a characteristic element. The "neural network"
Since the paragraph including the characteristic element is paragraphs A1, A5, and A6, the number of corresponding paragraphs is “3”.

【００４８】また、この段落分類部１４は、このような
特徴要素とその特徴要素を持つ段落の分類を行うととも
に、各コンテンツと段落との対応付けも行う。たとえ
ば、コンテンツ１に属する段落は段落Ａ１，Ａ２，Ａ３
であり、コンテンツ２に属する段落は段落Ａ１，Ａ３，
Ａ４，Ａ６であり、コンテンツ３に属する段落は段落Ａ
１，Ａ５，Ａ６であるというようなそれぞれのコンテン
ツとそのコンテンツに属する段落の対応付けも行う。The paragraph classifying unit 14 classifies such characteristic elements and paragraphs having the characteristic elements, and also associates each content with a paragraph. For example, paragraphs belonging to content 1 are paragraphs A1, A2, A3
And the paragraphs belonging to the content 2 are the paragraphs A1, A3,
A4 and A6, and the paragraph belonging to content 3 is paragraph A
The respective contents such as 1, A5 and A6 are associated with paragraphs belonging to the contents.

【００４９】以上のような分類結果は分類結果記憶部１
５に格納される。The classification result as described above is stored in the classification result storage 1
5 is stored.

【００５０】出力制御部１６は、分類結果記憶部１５の
内容を読み出して分類結果として出力する。この際、分
類結果を表示部１７に表示する制御を行うとともに、ユ
ーザから検索結果表示の指示があったときは、分類結果
記憶部１５の内容と前記文書群記憶部１１の内容に基づ
いて分類としてのデータを構成し、そのデータを出力す
ることも可能である。The output control section 16 reads out the contents of the classification result storage section 15 and outputs it as a classification result. At this time, control is performed to display the classification result on the display unit 17, and when a search result display instruction is given from the user, the classification is performed based on the contents of the classification result storage unit 15 and the contents of the document group storage unit 11. It is also possible to configure the data as and output the data.

【００５１】図５は表示部１７に表示された分類結果の
一例を示すもので、この例ではコンテンツとしては、コ
ンテンツ１、コンテンツ２、コンテンツ３の３つのコン
テンツのみを示しているので、１つの画面上にはコンテ
ンツ１〜３までしか示されていないが、実際には、もっ
と多数のコンテンツが存在するのが普通である。このよ
うな場合には、画面の見やすさを考慮して、１画面上に
は、たとえば、１０個程度のコンテンツを表示する。な
お、１画面に表示できるコンテンツ数は適当な数を設定
できるものであり、また、コンテンツ数が多い場合は、
１０個ぐらいずつに分けてページ切替えで表示するよう
にすることも可能である。FIG. 5 shows an example of the classification result displayed on the display section 17. In this example, only three contents, namely, contents 1, contents 2, and contents 3 are shown. Although only contents 1 to 3 are shown on the screen, in reality, there are usually many more contents. In such a case, for example, about ten contents are displayed on one screen in consideration of the visibility of the screen. In addition, the number of contents that can be displayed on one screen can be set to an appropriate number, and when the number of contents is large,
It is also possible to divide into about ten pages and display them by switching pages.

【００５２】この図５における表示内容の例は、コンテ
ンツ１の特徴要素は「エキスパートシステム」であり、
その段落数は「３」、コンテンツ２としては、特徴要素
が「自然言語」であり、その段落数は「４」、コンテン
ツ３としては、特徴要素が「ニューラルネットワーク」
であり、その段落数は「３」であるというように表示さ
れている。In the example of the display contents in FIG. 5, the characteristic element of the content 1 is “expert system”,
The number of paragraphs is "3", the characteristic element of the content 2 is "natural language", the number of paragraphs is "4", and the characteristic element of the content 3 is "neural network".
And the number of paragraphs is displayed as “3”.

【００５３】このように、各コンテンツ毎にそのコンテ
ンツを代表する特徴要素とその特徴要素を含む段落数が
表示される。また、その表示部１７には「結果表示」と
いったユーザの指示を入力するためのユーザ指示部２１
が表示される。As described above, for each content, the characteristic element representing the content and the number of paragraphs including the characteristic element are displayed. The display unit 17 has a user instruction unit 21 for inputting a user instruction such as “display result”.
Is displayed.

【００５４】なお、図５に示した表示内容は、たとえ
ば、「コンテンツ１」、「エキスパートシステム」、
「３」というように、コンテンツを表す番号と、その特
徴要素と、段落数であるが、この表示の仕方はこれに限
られるものではなく、たとえば、該当する段落番号を表
示内容に加えることもできる。一例として、「コンテン
ツ１」、「エキスパートシステム」、「３」と表示する
とともに、該当する段落として、「段落Ａ１，Ａ２，Ａ
３」というように段落を表す内容（たとえば、段落番号
など）の表示を行うようにしてもよい。The display contents shown in FIG. 5 include, for example, “content 1”, “expert system”,
The number indicating the content, its characteristic element, and the number of paragraphs, such as “3”, are not limited to this. For example, the corresponding paragraph number may be added to the display content. it can. As an example, “content 1”, “expert system”, and “3” are displayed, and as the corresponding paragraph, “paragraphs A1, A2, A
For example, the content indicating a paragraph (for example, a paragraph number or the like) such as “3” may be displayed.

【００５５】ユーザはこのような表示内容を見て、ユー
ザ自身の要求している情報が、たとえば、コンテンツ１
の内容（特徴要素は「エキスパートシステム」）に関係
するものではないかと判断した場合は、そのコンテンツ
１の行部分Ｒ１をマウスなどでクリックしたのち、「結
果表示」のユーザ指示部２１をクリックする。The user looks at such display contents and finds that the information requested by the user is, for example, the content 1
Is determined to be related to the content (characteristic element is “expert system”), the user clicks the row portion R1 of the content 1 with a mouse or the like, and then clicks the user instruction unit 21 of “display result”. .

【００５６】これにより、出力制御部１６は、選択され
たコンテンツ１に属する段落を文書群記憶部１１から読
み出して、そのコンテンツの内容を出力する。この例で
は、選択されたコンテンツ１に属する段落数は「３」で
あり、その内容は段落Ａ１，Ａ２，Ａ３であるから、出
力制御部１６は、ユーザからの結果表示要求を受ける
と、ユーザの選択したコンテンツに属する段落Ａ１，Ａ
２，Ａ３の内容を文書群記憶部１１から読み出して、そ
の内容を図６のように表示する図６は選択されたコンテ
ンツ１に属する段落Ａ１，Ａ２，Ａ３のそれぞれの文書
内容を表示した例を示すもので、この例では、１画面上
に３つの段落Ａ１，Ａ２，Ａ３の文書内容がすべて表示
された例を示しているが、１画面上に表示しきれない場
合には、ページ送りをすることで見ることができるよう
にする。As a result, the output control unit 16 reads the paragraph belonging to the selected content 1 from the document group storage unit 11 and outputs the content of the content. In this example, the number of paragraphs belonging to the selected content 1 is “3”, and the contents are paragraphs A1, A2, and A3. Therefore, when the output control unit 16 receives a result display request from the user, A1, A belonging to the selected content
The contents of A2, A3 are read from the document group storage unit 11 and the contents are displayed as shown in FIG. 6. FIG. 6 shows an example in which the respective document contents of paragraphs A1, A2, A3 belonging to the selected content 1 are displayed. This example shows an example in which all the document contents of three paragraphs A1, A2, and A3 are displayed on one screen. However, if the contents cannot be displayed on one screen, the page is turned. So that you can see it.

【００５７】図７は以上説明したこの実施の形態の処理
手順をフローチャートである。図７において、まず、文
書から段落を検出し、また、形態素解析を行い、その形
態素解析結果に基づいて特徴要素を抽出し、その抽出さ
れた特徴要素に基づいて図４に示されるような特徴テー
ブル１３３を作成する（ステップｓ１）。以上の処理
は、文解析部１２と特徴テーブル作成部１３にて行う。
次に、段落分類部１４がその特徴テーブル１３３を参照
してそれぞれの段落をコンテンツに分類し（ステップｓ
３）、その分類結果を分類結果記憶部１５に記憶させる
とともに、表示部１７に表示する（ステップｓ４）。こ
のとき表示される分類結果の一例としては、たとえば、
図５で示すような内容である。FIG. 7 is a flowchart showing the processing procedure of this embodiment described above. In FIG. 7, first, a paragraph is detected from a document, a morphological analysis is performed, a characteristic element is extracted based on the result of the morphological analysis, and a characteristic as shown in FIG. 4 is obtained based on the extracted characteristic element. A table 133 is created (step s1). The above processing is performed by the sentence analysis unit 12 and the feature table creation unit 13.
Next, the paragraph classifying unit 14 classifies each paragraph into contents with reference to the feature table 133 (step s).
3) The classification result is stored in the classification result storage unit 15 and displayed on the display unit 17 (step s4). An example of the classification result displayed at this time is, for example,
The contents are as shown in FIG.

【００５８】そして、ユーザがその表示を見て、所望と
するコンテンツを選択し、ユーザ指示部２１から「結果
表示」の入力を行うと、出力制御部１６は、そのユーザ
指示を受け付け（ステップｓ４）、文書群記憶部１１と
分類結果記憶部１５のそれぞれの内容から出力データを
構成して（ステップｓ５）、そのデータを出力する（ス
テップｓ６）。この出力データの一例としては、たとえ
ば、図６で示すような内容である。When the user looks at the display and selects the desired content and inputs "display result" from the user instruction section 21, the output control section 16 receives the user instruction (step s4). ), Output data is constructed from the contents of the document group storage unit 11 and the classification result storage unit 15 (step s5), and the data is output (step s6). An example of this output data is, for example, the content as shown in FIG.

【００５９】以上説明したように、この実施の形態で
は、まず、文書から段落を抽出し、その段落ごとに内容
の特徴を表す特徴要素を抽出し、それぞれの特徴要素に
もとづいて、それぞれの段落を同じ特徴要素を持つ１つ
のまとまりとみなし、それをコンテンツとして分類す
る。As described above, in this embodiment, first, a paragraph is extracted from a document, characteristic elements representing the characteristics of the contents are extracted for each paragraph, and each paragraph is extracted based on each characteristic element. Is regarded as one unit having the same characteristic element, and is classified as content.

【００６０】このように、文書の切り分けは、文書の内
容や種類に関係なく行えばよいので、どのような文書に
も対応でき、また、切り分けられた段落の文書内容をそ
のまま１つの段落単位で用いるのではなく、それぞれの
段落の特徴要素に基づいて、それぞれ共通する特徴要素
を持つ場合は、それらの段落を意味的なまとまりと考え
て、そのまとまりを１つのコンテンツとして抽出するよ
うにしている。As described above, since the document can be separated regardless of the content and type of the document, any document can be dealt with, and the document content of the separated paragraph can be directly converted into one paragraph unit. Instead of using them, based on the characteristic elements of each paragraph, if they have common characteristic elements, the paragraphs are considered as a semantic unit, and the unit is extracted as one content. .

【００６１】したがって、文書解析部２が行う文書の切
り分け単位は段落であっても、出力される内容は段落単
位ではなく、意味的なまとまりを有する複数の段落の文
書内容で構成されるコンテンツ単位で出力されるので、
一つの出力単位が細分化され過ぎるようなことが無くな
る。Therefore, even if the unit of document separation performed by the document analysis unit 2 is a paragraph, the output content is not a paragraph unit, but a content unit composed of a plurality of paragraphs having a meaningful unit. Is output as
One output unit is not over-subdivided.

【００６２】これにより、前述したように、オンライン
情報において、更新された部分と更新前の情報との差を
見るというような場合、意味的なまとまりを有する複数
の段落で構成されるコンテンツで出力されることによ
り、その内容の前後関係や大局的な内容を理解するのに
きわめて好都合となる。Thus, as described above, in the case where the difference between the updated part and the information before the update is viewed in the online information, the content is output as a content composed of a plurality of paragraphs having a meaningful unity. By doing so, it is extremely convenient to understand the context of the content and the overall content.

【００６３】なお、以上説明した実施の形態は、本発明
の好適な実施の形態の一例であるが、本発明はこれに限
定されるものではなく、本発明の要旨を逸脱しない範囲
で種々変形実施可能となるものである。また、本発明の
処理を行う処理プログラムは、フロッピィディスク、光
ディスク、ハードディスクなどの記憶媒体に記憶させて
おくことができ、本発明は、それらの記憶媒体をも含む
ものであり、また、ネットワークからデータを得る形式
でもよい。Although the above-described embodiment is an example of a preferred embodiment of the present invention, the present invention is not limited to this, and various modifications may be made without departing from the spirit of the present invention. It will be feasible. Further, a processing program for performing the processing of the present invention can be stored in a storage medium such as a floppy disk, an optical disk, or a hard disk, and the present invention also includes such a storage medium. A format for obtaining data may be used.

【００６４】[0064]

【発明の効果】以上説明したように、本発明は、ある文
書から段落を抽出し、それぞれの段落ごとに特徴要素を
抽出し、前記文書を意味的なまとまりごとのコンテンツ
に分類して出力する。また、複数のコンテンツに分類し
て表示し、ユーザからのコンテンツ選択指示を受けたと
き、その選択されたコンテンツに属する文書の内容を出
力するようにしているので、文書の切り分けは、文書の
内容や種類に関係なく段落で行えばよく、どのような文
書にも対応できる。そして、段落をそのまま用いるので
はなく、意味的なまとまりを有する複数の段落で構成さ
れるコンテンツ単位で出力されるので、一つの出力単位
が細かく切り分けられ過ぎるようなことが無くなる。As described above, according to the present invention, a paragraph is extracted from a certain document, a characteristic element is extracted for each paragraph, and the document is classified and output as a semantic unit. . In addition, the content is classified into a plurality of contents and displayed, and upon receiving a content selection instruction from the user, the contents of the document belonging to the selected content are output. It does not matter what type of document or paragraph you use, and can handle any document. Then, instead of using the paragraph as it is, it is output in a content unit composed of a plurality of paragraphs having a semantic unit, so that one output unit is not excessively divided.

【００６５】また、文書を意味的なまとまりごとのコン
テンツに分類する処理は、それぞれの段落の特徴要素に
基づいて、共通する特徴要素を持つ段落をコンテンツと
して分類するようにしているので、様々な種類の文書に
対応する段落の切り分けかたを適用して段落を細かく切
り分けすぎるような場合にも、意味的なまとまりを有す
る複数の段落で構成されるコンテンツ単位で出力される
ので、処理対象文書が意味がわからないように細分化さ
れた状態で抽出されるようなことが無くなる。In the process of classifying a document into contents of a semantic unit, paragraphs having common characteristic elements are classified as contents based on characteristic elements of respective paragraphs. Even when the paragraph is divided too finely by applying the paragraph division method corresponding to the type of document, the document to be processed is output in the content unit consisting of multiple paragraphs having a semantic unit. Is not extracted in a subdivided state so that the meaning is not understood.

【００６６】さらに、文書を意味的なまとまりごとのコ
ンテンツに分類して表示する場合に、その表示内容は、
それぞれのコンテンツごとに、少なくとも、そのコンテ
ンツを代表する特徴要素、または、そのコンテンツを代
表する特徴要素と、そのコンテンツに含まれる段落に関
するデータからなっているので、画面上からそれぞれの
コンテンツ内容を把握し易くなり、所望とするコンテン
ツの選択を効率よく行える。Further, when a document is classified and displayed as contents in a meaningful unit, the display contents are as follows:
Each content consists of at least a characteristic element representing the content, or a characteristic element representing the content, and data related to paragraphs included in the content. This makes it easy to select desired contents efficiently.

【００６７】そして、ユーザからのコンテンツ選択指示
を受けたとき、その選択されたコンテンツに属する段落
の内容を出力することにより、出力される内容は１つの
段落の文書内容ではなく、意味的なまとまりを有する複
数の段落の文書内容が出力される。これにより、たとえ
ば、コンテンツ単位で何らかの処理を行うような場合
（たとえば、オンライン情報において、更新された部分
と更新前の情報との差を見るというような場合）、意味
的なまとまりを有する複数の段落で構成されるコンテン
ツ単位で出力されるので、その内容の前後関係や大局的
な内容を理解するのにきわめて好都合となる。When a content selection instruction is received from the user, the content of the paragraph belonging to the selected content is output, so that the output content is not a document content of one paragraph but a semantic unit. Is output in a plurality of paragraphs having Accordingly, for example, when some processing is performed in content units (for example, when a difference between an updated part and information before update is viewed in online information), a plurality of semantic units are set. Since the content is output in units of content composed of paragraphs, it is extremely convenient for understanding the context of the content and the overall content.

[Brief description of the drawings]

【図１】本発明の実施の形態における文書情報抽出装置
の構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a document information extraction device according to an embodiment of the present invention.

【図２】本発明の実施の形態に用いられる文書群の例を
示す図。FIG. 2 is a diagram showing an example of a document group used in the embodiment of the present invention.

【図３】本発明の実施の形態に用いられるある文書の段
落構成例を示す図。FIG. 3 is a diagram showing an example of a paragraph configuration of a certain document used in the embodiment of the present invention.

【図４】本発明の実施の形態における特徴テーブルの一
例を示す図。FIG. 4 is a diagram showing an example of a feature table according to the embodiment of the present invention.

【図５】本発明の実施の形態における複数のコンテンツ
の表示例を示す図。FIG. 5 is a diagram showing a display example of a plurality of contents according to the embodiment of the present invention.

【図６】図４に示した複数のコンテンツのなかから選択
されたコンテンツに属する段落内容の出力例を示す図。FIG. 6 is an exemplary view showing an example of outputting paragraph contents belonging to a content selected from the plurality of contents shown in FIG. 4;

【図７】本発明の実施の形態における処理手順を説明す
るフローチャート。FIG. 7 is a flowchart illustrating a processing procedure according to the embodiment of the invention.

[Explanation of symbols]

１１文書群記憶部１２文解析部１３特徴テーブル作成部１４段落分類部１５分類結果記憶部１６出力制御部１７表示部２１「結果表示」のユーザ指示部１３１特徴要素抽出部１３２特徴抽出部１３３特徴テーブルＡ１〜Ａ６段落 Reference Signs List 11 Document group storage unit 12 Sentence analysis unit 13 Feature table creation unit 14 Paragraph classification unit 15 Classification result storage unit 16 Output control unit 17 Display unit 21 User instruction unit of “Result display” 131 Feature element extraction unit 132 Feature extraction unit 133 Features Table A1 to A6 paragraph

Claims

[Claims]

Claims 1. A paragraph is detected from the contents of a document,
For each of the paragraphs, a feature element representing a feature of the content is extracted, a feature table representing a relationship between the feature element and a paragraph including the feature element is created, and based on the feature table,
A document information extracting method, wherein the document is classified and output into a plurality of contents for each semantic unit.

2. A process of classifying and outputting the document into a plurality of contents of each semantic unit, and displaying and classifying the document into a plurality of contents of each semantic unit based on the feature table. 2. The document information extracting method according to claim 1, wherein when receiving a content selection instruction from a user, the contents of a paragraph belonging to the selected content are output.

3. A process for classifying a document into a plurality of contents in each meaningful unit based on the characteristic table, wherein a plurality of paragraphs having a common characteristic element are grouped into one based on the characteristic element of each paragraph. 3. The document information extracting method according to claim 1, wherein the set is a content.

4. The display content when the content is classified and displayed in each semantic unit, and at least, for each content, at least a characteristic element representing the content. Document information extraction method described.

5. The display contents when the contents are classified and displayed in the semantic unit are, for each content, at least a characteristic element representing the content and data on a paragraph included in the content. 3. The document information extracting method according to claim 2, wherein:

6. A document group storage unit for storing a document group, and a sentence analysis unit for detecting a paragraph from the document content of the document stored in the document group storage unit and performing morphological analysis for each paragraph. A feature table creation unit that extracts a feature element representing a feature of the content for each paragraph from the analysis result by the sentence analysis unit and creates a feature table representing a relationship between the feature element and a paragraph including the feature element A paragraph classifying unit that classifies the document into a plurality of contents in each of meaningful units based on the contents of the feature table; a classification result storage unit that stores the contents classified by the paragraph classifying unit; A document information extraction device, comprising: an output control unit that reads and outputs the contents of a result storage unit.

7. The output control unit controls the display of the plurality of contents and, when receiving a content selection instruction from a user, outputs the contents of a paragraph belonging to the selected content. The document information extracting device according to claim 6, wherein

8. A process of classifying a document into contents of each semantic unit based on a feature table performed by the paragraph classifying unit, wherein a plurality of paragraphs having a common feature element are classified based on a feature element of each paragraph. 8. The document information extracting device according to claim 6, wherein the document information is extracted as one unit, and the unit is set as a content.

9. The display content when the content is classified and displayed in each semantic unit, and at least a characteristic element representative of the content is displayed for each content. Document information extraction device.

10. The display contents when the contents are classified and displayed in each of the semantic units are, for each content, at least a characteristic element representing the content and data on a paragraph included in the content. 8. The document information extracting device according to claim 7, wherein:

11. A storage medium in which a processing program for performing a document extraction process by a computer is described, wherein the processing program detects a paragraph from the document content of a certain document, and indicates a characteristic element representing a feature of the content for each paragraph. To create a feature table representing the relationship between the feature element and the paragraph containing the feature element, and based on the feature table, classify the document into a plurality of contents for each semantic unit and output it A storage medium storing a document information extraction processing program characterized by performing the following.

12. The process of classifying and outputting the document into a plurality of contents in each of the semantic units, wherein the document is classified into a plurality of contents in each of the semantic units based on the feature table and displayed. And outputting a content of a paragraph belonging to the selected content when receiving a content selection instruction from a user.
A storage medium storing a document information extraction processing program described in Item No. 0.

13. A process of classifying a document into a plurality of contents for each semantic unit based on the characteristic table, wherein a plurality of paragraphs having a common characteristic element are grouped into one based on the characteristic element of each paragraph. 2. The process according to claim 1, wherein the process is a process in which the content is set as a content.
A storage medium storing the document information extraction processing program according to 1 or 12.

14. The display content when the content is classified and displayed in each semantic unit, and is at least a characteristic element representing the content for each content. A storage medium storing the described document information extraction processing program.

15. The display contents when the contents are classified and displayed in the semantic unit are, for each content, at least a characteristic element representing the content and data on a paragraph included in the content. 13. A storage medium storing the document information extraction processing program according to claim 12.