JP2001134600A

JP2001134600A - System and method for information extraction and recording medium stored with recorded program for information extraction

Info

Publication number: JP2001134600A
Application number: JP31706999A
Authority: JP
Inventors: Hiroshi Yamada; 洋志山田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-11-08
Filing date: 1999-11-08
Publication date: 2001-05-18

Abstract

PROBLEM TO BE SOLVED: To accurately extract information matching a purpose from a set of documents having different contents and formats. SOLUTION: A document classifying means 1 classifies the document set into specified categories. An information extracting means 3 judges which information is extracted according to the classifications of the documents and extracts information from the documents. An information classifying means 5 classifies the extracted information. A result selecting means 7 selects only necessary information from the extracted and classified information. A result output means 186 divides and outputs the information according to the classification result of the information.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は文書中から特定の情
報を抽出する情報抽出システム、情報抽出方法および情
報抽出用プログラムを記録した記録媒体（以下、情報抽
出システムと記載する。）に関し、特に文書を分類し、
この分類結果に応じて抽出する情報を変えることができ
る情報抽出システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information extraction system for extracting specific information from a document, an information extraction method, and a recording medium on which an information extraction program is recorded (hereinafter, referred to as an information extraction system). Classify documents,
The present invention relates to an information extraction system that can change information to be extracted according to a result of the classification.

【０００２】[0002]

【従来の技術】従来、この種の情報抽出システムの一例
が、特開平８−３２９１６５号公報に記載されている。
この従来の情報抽出システムの動作のフローチャートを
図２３に示す。以下では、この図２３を用いてこの従来
技術の動作を説明する。2. Description of the Related Art Conventionally, an example of this type of information extraction system is described in Japanese Patent Application Laid-Open No. 8-329165.
FIG. 23 shows a flowchart of the operation of this conventional information extraction system. The operation of this conventional technique will be described below with reference to FIG.

【０００３】処理対象となる文書が入力されると(4801
〜4803)、その文書から特定のパターンを持つ文字列を
数値データとして抽出し(4804)、さらに、その数値デー
タの前後に存在する一定規則に基づいた文字列を数字文
字列データとして抽出する(4805)。この数字文字列デー
タの中から名詞データを抽出し、これを所定の項目毎に
分類して、上記数値データを対応付ける(4806)。このよ
うにして得られた各項目の数値データを１ヵ月等の所定
期間毎に集計し(4807)、その集計結果データを表示する
(4808)。When a document to be processed is input (4801
To 4803), extract a character string having a specific pattern from the document as numerical data (4804), and further extract a character string based on a certain rule before and after the numerical data as numeric character string data ( 4805). Noun data is extracted from the numeric character string data, classified for each predetermined item, and associated with the numeric data (4806). The numerical data of each item obtained in this way is tabulated for each predetermined period such as one month (4807), and the tabulated result data is displayed.
(4808).

【０００４】この従来のシステムを利用すると家計簿の
ような特定形式の文書から、買物に使った金額を抽出す
ることができる。[0004] By using this conventional system, it is possible to extract the amount spent for shopping from a document in a specific format such as a household account book.

【０００５】次に、形式を問わない文書から情報を抽出
する従来システムとしては、論文「固有名詞抽出システ
ムの開発とIREX-NEにおける評価」(IREX-NEワークショ
ップ予稿集、pp.171-178、1999)に記述されているシス
テムがある。このシステムでは、文書中から地名、人
名、金額などを抽出できる。[0005] Next, as a conventional system for extracting information from a document of any format, a dissertation titled "Development of a proper noun extraction system and evaluation in IREX-NE" (Preprints of the IRX-NE Workshop, pp.171-178) , 1999). In this system, a place name, a person name, a price, and the like can be extracted from a document.

【０００６】また、World Wide Web上で行われている情
報サービスの１つに、就職情報やプレゼントの情報を集
め、そこから会社名や賞品などの情報を人手で取り出し
たものをホームページ上でまとめて提示することが近年
行われている。[0006] One of the information services provided on the World Wide Web is to collect employment information and present information, and manually extract information such as company names and prizes on the homepage. It has been done recently.

【０００７】[0007]

【発明が解決しようとする課題】上記従来の技術の第１
の問題点は、内容や形式にばらつきのある文書集合から
は十分な精度で情報を抽出することができないというこ
とである。その理由は、従来の情報抽出システムでは特
定の形式の文書を前提としているためである。SUMMARY OF THE INVENTION The first of the prior arts described above.
The problem is that information cannot be extracted with sufficient accuracy from a set of documents that vary in content and format. The reason is that the conventional information extraction system assumes a document of a specific format.

【０００８】次に第２の問題点は、必要のない情報も抽
出してしまうということである。その理由は、どのよう
な情報が必要かはそれぞれの文書の作成目的や利用目的
によって変わってくるため、たとえば、金額や日付のよ
うな同じ情報であっても、出現する文書の種類や文書中
の前後関係によって必要かどうかが変わってくるためで
ある。そのため、全ての文書、あるいは、文書全体か
ら情報を抽出すると必要のない情報が含まれてしまう。A second problem is that unnecessary information is also extracted. The reason is that what kind of information is needed depends on the purpose of creation and use of each document. The necessity depends on the context. Therefore, if information is extracted from all documents or the entire document, unnecessary information is included.

【０００９】次に第３の問題点は、第１、第２の問題点
を補うために人手による作業を導入すると大量の文書に
対応することができない、あるいは、非常に多くの時間
や費用がかかるということである。[0009] The third problem is that if manual work is introduced to compensate for the first and second problems, it is impossible to cope with a large number of documents, or very much time and cost will be required. That is to say.

【００１０】よって本発明の目的は、上記従来技術の問
題点を鑑み、不定形の文書集合からユーザの使用目的に
応じた情報を抽出できる情報抽出システムを提供するこ
とにある。SUMMARY OF THE INVENTION It is therefore an object of the present invention to provide an information extraction system capable of extracting information according to a user's purpose of use from an irregular document set in view of the above-mentioned problems of the related art.

【００１１】また本発明の他の目的は、必要な文書を容
易に選択するための情報抽出システムを提供することに
ある。Another object of the present invention is to provide an information extraction system for easily selecting a required document.

【００１２】[0012]

【課題を解決するための手段】本発明の情報抽出システ
ムは、文書を分類する文書分類手段と、文書の分類に対
応して抽出する情報と抽出方法を変えることのできる情
報抽出手段とを備え、文書の分類結果に応じた情報を抽
出するよう動作する。An information extracting system according to the present invention comprises a document classifying means for classifying a document, and information extracting means capable of changing the information to be extracted and the extracting method in accordance with the classification of the document. , And operates to extract information according to the classification result of the document.

【００１３】さらに、文書の分類に応じて抽出した情報
を分類することのできる情報抽出手段を備える。[0013] Further, there is provided an information extracting means capable of classifying the information extracted according to the classification of the document.

【００１４】このような構成を採用し、各文書から必要
な情報だけを正確に抽出することにより上記本発明の目
的を達成することができる。By adopting such a configuration and extracting only necessary information accurately from each document, the object of the present invention can be achieved.

【００１５】[0015]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して詳細に説明する。Next, embodiments of the present invention will be described in detail with reference to the drawings.

【００１６】（第１の実施の形態）［構成の説明］図１を参照すると、本発明の第１の実施
の形態は、文書分類手段1と、情報抽出制御手段2と、複
数の情報抽出手段3と、情報分類制御手段4と、複数の情
報分類手段5と、結果出力手段6から構成されている。(First Embodiment) [Explanation of Configuration] Referring to FIG. 1, a first embodiment of the present invention comprises a document classification unit 1, an information extraction control unit 2, a plurality of information extraction units. It comprises means 3, information classification control means 4, a plurality of information classification means 5, and result output means 6.

【００１７】情報抽出手段3は、文書の分類の種類に応
じて複数存在し、抽出実行手段31と、抽出情報定義手段
32と、抽出知識格納手段33とを含む。There are a plurality of information extracting means 3 corresponding to the types of the document classifications.
32 and extracted knowledge storage means 33.

【００１８】情報分類手段5は、文書の分類の種類に応
じて複数存在し、分類実行手段51と、分類知識格納手段
52とを含む。There are a plurality of information classifying means 5 according to the type of document classification.
Including 52.

【００１９】これらの手段はそれぞれ概略以下のように
動作する。Each of these means operates as follows.

【００２０】文書分類手段1は、複数の文書を入力と
し、文書を指定されたカテゴリーに分類する。The document classification means 1 receives a plurality of documents and classifies the documents into designated categories.

【００２１】情報抽出制御手段2は、文書分類手段1の文
書の分類結果に応じて使用する情報抽出手段3を決定
し、決定した情報抽出手段3に文書から情報を抽出させ
る。また、情報抽出制御手段2よって動作される情報抽
出手段3は、情報抽出制御手段2に指定された文書から情
報を抽出する。The information extraction control means 2 determines the information extraction means 3 to be used according to the document classification result of the document classification means 1, and causes the determined information extraction means 3 to extract information from the document. The information extraction means 3 operated by the information extraction control means 2 extracts information from the document specified by the information extraction control means 2.

【００２２】情報分類制御手段4は、文書の分類結果に
応じて使用する情報分類手段5を決定し、決定した情報
分類手段５に文書から抽出した情報を分類させる。ま
た、情報分類制御手段4によって動作される情報分類手
段5は、情報分類制御手段4に指定された情報を分類す
る。The information classification control means 4 determines the information classification means 5 to be used in accordance with the classification result of the document, and causes the determined information classification means 5 to classify the information extracted from the document. The information classification means 5 operated by the information classification control means 4 classifies the information specified by the information classification control means 4.

【００２３】結果出力手段6は、文書の分類結果、抽出
された情報、情報の分類結果を出力する。The result output means 6 outputs a document classification result, extracted information, and information classification result.

【００２４】ここで、情報抽出制御手段2と情報抽出手
段3について、図２を参照して詳細に説明する。Here, the information extraction control means 2 and the information extraction means 3 will be described in detail with reference to FIG.

【００２５】情報抽出手段3は、文書の分類カテゴリー
の数に合わせて複数用意される。図２では3個の情報抽
出手段3a、3b、3cが記述されているが、これは、情報抽
出手段を3個に限定するものではない。A plurality of information extracting means 3 are prepared in accordance with the number of document classification categories. In FIG. 2, three information extracting means 3a, 3b, 3c are described, but this does not limit the number of information extracting means to three.

【００２６】情報抽出制御手段2は、文書分類手段1の分
類結果に対応して情報抽出手段3を選択する。The information extraction control means 2 selects the information extraction means 3 according to the classification result of the document classification means 1.

【００２７】情報抽出手段3は、抽出実行手段31と、抽
出情報定義手段32と、抽出知識格納手段33から構成され
ている。The information extracting means 3 comprises an extracting executing means 31, an extracting information defining means 32, and an extracted knowledge storing means 33.

【００２８】抽出実行手段31は、抽出情報定義手段32と
抽出知識格納手段33を参照して、文書のカテゴリーに対
応した情報を文書から抽出する。抽出情報定義手段32
は、文書の分類カテゴリーに対応する情報の種類を格納
する。抽出知識格納手段33は、抽出情報定義手段32に格
納されている各情報を判別するための方法を格納する。The extraction executing means 31 refers to the extracted information defining means 32 and the extracted knowledge storing means 33 and extracts information corresponding to the category of the document from the document. Extraction information definition means 32
Stores the type of information corresponding to the classification category of the document. The extracted knowledge storage means 33 stores a method for determining each piece of information stored in the extracted information definition means 32.

【００２９】情報分類制御手段4と情報分類手段5につい
て図３を参照して説明する。The information classification control means 4 and the information classification means 5 will be described with reference to FIG.

【００３０】情報分類手段5は、文書の分類カテゴリー
の数に合わせて複数用意される。図３では3個の情報分
類手段5a、5b、5cが記述されているがこれは、情報分類
手段を3個に限定するものではない。A plurality of information classifying means 5 are prepared in accordance with the number of document classification categories. In FIG. 3, three information classifying means 5a, 5b, 5c are described, but this does not limit the information classifying means to three.

【００３１】情報分類手段5は、分類実行手段51と、分
類知識格納手段52から構成されている。The information classifying means 5 comprises a classification executing means 51 and a classification knowledge storing means 52.

【００３２】分類実行手段51は、分類知識格納手段52を
参照して、情報抽出手段3で抽出した情報を分類する。
分類知識格納手段52は、抽出した情報の種類ごとに情報
を分類するための規則を格納している。The classification executing means 51 classifies the information extracted by the information extracting means 3 with reference to the classification knowledge storing means 52.
The classification knowledge storage means 52 stores rules for classifying information for each type of extracted information.

【００３３】[動作の説明]次に、図１及び図４のフロー
チャートを参照して本実施の形態の全体の動作について
詳細に説明する。[Explanation of Operation] Next, the overall operation of the present embodiment will be described in detail with reference to the flowcharts of FIGS.

【００３４】まず、文書分類手段1により文書をカテゴ
リーに分類する（図４のステップ401）。分類の方法と
しては、本出願人が本発明の出願前に出願したWebペー
ジの特徴を利用したタイプ分類技術（特願平10-200171
号）や、単語情報を用いた分類技術を利用可能である
が、この文書分類手段1の構成を限定するものではな
い。文書分類手段1の構成としては、様々な既存技術を
用いることができ、当業者であれば十分に理解し得るも
のである。また、一つの文書が複数のカテゴリーに分類
される、あるいは、どこにも分類されない場合があって
もよい。First, documents are classified into categories by the document classifying means 1 (step 401 in FIG. 4). As a classification method, a type classification technology (Japanese Patent Application No. 10-200171) utilizing the characteristics of a Web page filed by the present applicant before filing the present invention is proposed.
Or a classification technique using word information can be used, but the configuration of the document classification means 1 is not limited. Various existing technologies can be used as the configuration of the document classifying means 1, and those skilled in the art can fully understand. One document may be classified into a plurality of categories, or may not be classified anywhere.

【００３５】次に、情報抽出制御手段2によって、ステ
ップ401における文書の分類結果に対応した情報抽出手
段3を選択し、文書を分類させる(ステップ402)。Next, the information extraction control means 2 selects the information extraction means 3 corresponding to the document classification result in step 401, and classifies the document (step 402).

【００３６】情報抽出手段3は、文書から情報を抽出す
る(ステップ403)。各情報抽出手段3は、対応する文書カ
テゴリーに応じた情報を文書から抽出する。抽出実行手
段31は、抽出情報定義手段32を参照して、文書から抽出
する情報を得る。さらに、抽出実行手段31は、抽出知識
格納手段33を参照して抽出する各情報を判別するための
方法を得る。この判別のための方法としては、・抽出する情報そのものの形式・表や箇条書きなどにおける項目名や記述位置・抽出する情報の前後に共起する単語や記号などの表記
方法・直接記述されていない情報を推定するための規則など
があり、さらに複数の方法を組み合わせることもでき
る。The information extracting means 3 extracts information from the document (Step 403). Each information extracting means 3 extracts information corresponding to the corresponding document category from the document. The extraction executing means 31 obtains information to be extracted from the document by referring to the extraction information defining means 32. Further, the extraction executing means 31 obtains a method for determining each piece of information to be extracted with reference to the extracted knowledge storage means 33. The methods for this determination include:-the format of the information to be extracted itself-item names and description positions in tables and bullets-notation methods of words and symbols that co-occur before and after the information to be extracted-directly described There are rules for estimating missing information, and a plurality of methods can be combined.

【００３７】次に、情報分類制御手段4によって、文書
の分類結果に対応した情報分類手段5を選択し、文書か
ら抽出した情報を分類させる(ステップ404)。Next, the information classification control means 4 selects the information classification means 5 corresponding to the document classification result, and classifies the information extracted from the document (step 404).

【００３８】情報分類手段5は、対応している文書カテ
ゴリーに応じて、抽出した情報を分類する(ステップ40
5)。分類実行手段51は、分類知識格納手段52を参照し
て、抽出した情報の種類ごとに情報を分類する。情報の
分類知識としては、・数値や時間の情報をいくつかの範囲に分割する・単語と分類の対応表を用意する・階層構造を持つ辞書を用いる・文字列のパタンマッチによる分類などがあり、さらに
複数の方法を組み合わせることもできる。The information classifying means 5 classifies the extracted information according to the corresponding document category (step 40).
Five). The classification executing means 51 classifies the information for each type of the extracted information with reference to the classification knowledge storage means 52. Information classification knowledge includes: ・ Dividing numerical and time information into several ranges ・ Preparing a correspondence table between words and classification ・ Using a dictionary with a hierarchical structure ・ Classification based on pattern matching of character strings Further, a plurality of methods can be combined.

【００３９】最後に、結果を出力する(ステップ406)。
文書名と抽出した情報の分類結果を出力する。また、文
書カテゴリー、抽出した情報を出力するようにしてもよ
い。Finally, the result is output (step 406).
Outputs the classification result of the document name and the extracted information. Further, the document category and the extracted information may be output.

【００４０】次に、本実施の形態の効果について説明す
る。Next, effects of the present embodiment will be described.

【００４１】本実施の形態では、文書を分類してから分
類結果に応じて必要な情報を抽出するというように構成
されているため、さまざまな種類の文書が混在している
場合でも必要な情報だけを抽出できる。In the present embodiment, since documents are classified and necessary information is extracted according to the classification result, necessary information is obtained even when various types of documents are mixed. Only can be extracted.

【００４２】また、本実施の形態では、さらに、文書の
分類結果に応じて情報を抽出する方法を選択するように
構成されているため、高精度の情報抽出が可能になる。Further, in this embodiment, since a method for extracting information is selected in accordance with the classification result of the document, highly accurate information extraction becomes possible.

【００４３】さらに、抽出した情報をもとに文書を参照
することで、ユーザが必要な文書を容易に発見すること
が可能になる。Further, by referring to a document based on the extracted information, the user can easily find a necessary document.

【００４４】[実施例]次に、具体的な実施例を用いて本
実施の形態の動作を説明する。[Example] Next, the operation of this embodiment will be described using a specific example.

【００４５】本実施例では、Webページ(HTMLファイル)
から情報を抽出する場合を例にあげて説明する。In this embodiment, a Web page (HTML file)
A case where information is extracted from the information will be described as an example.

【００４６】分類対象となるWebページは、あらかじ
め、自動収集プログラムやダウンロードプログラムによ
って記憶装置上に保存しておく。あるいは、Webページ
のURLの一覧を用意して、必要に応じてダウンロードす
るように構成してもよい。The Web pages to be classified are stored on a storage device in advance by an automatic collection program or a download program. Alternatively, a list of Web page URLs may be prepared and downloaded as needed.

【００４７】最初に、文書分類手段1によってWebページ
を分類する。First, the Web page is classified by the document classifying means 1.

【００４８】本実施例では、文書分類手段1として、特
願平10-200171号に記述されている構造化文書検索シス
テムを利用する例を挙げる。ただし、先にも記載したよ
うに、本発明の文書分類手段1は、この例だけに限定さ
れるものではない。このシステムでは、文書に対してい
くつかのタイプを設定し、各文書と該タイプの適合度を
計算する。適合度の基準値を設定して、文書を基準値以
上の適合度が得られたタイプに分類することができる。In this embodiment, an example in which a structured document search system described in Japanese Patent Application No. 10-200171 is used as the document classification means 1 will be described. However, as described above, the document classification means 1 of the present invention is not limited to this example. In this system, several types are set for a document, and the relevance of each document and the type is calculated. By setting a reference value of the relevance, the document can be classified into a type having a relevance higher than the reference value.

【００４９】図５はWebページに対して適合度を計算し
た例を示す図である。図５で、“x.html”、“y.htm
l”、“z.html”はページ名である。URLにはドメイン名
やディレクトリ名が付くが図では省略した。また、「求
人情報」、「イベント」、「プレゼント」が分類結果と
なるタイプ名で、数値が適合度であり、数値が高ければ
高いほど、当該文書に含まれている内容がそのタイプと
適合していることになる。ここで、この例における適合
度の基準値を70とすると、“x.html”は「イベント」、
“y.html”は「求人情報」、“z.html”は「プレゼン
ト」に分類される。FIG. 5 is a diagram showing an example in which the degree of conformity is calculated for a Web page. In FIG. 5, “x.html”, “y.htm”
"l" and "z.html" are page names. Domain names and directory names are appended to URLs, but they are omitted in the figure. "Job information", "events", and "presents" are the types that result in classification. By name, the numerical value is the relevance, and the higher the numerical value, the more the content contained in the document is compatible with the type. Here, assuming that the reference value of the goodness of fit in this example is 70, “x.html” is “event”,
“Y.html” is classified as “job information” and “z.html” is classified as “present”.

【００５０】次に、情報抽出制御手段2によって各ペー
ジの分類結果に対応する情報抽出手段3を選択する。
「求人情報」、「イベント」、「プレゼント」に分類さ
れたページからは、それぞれ、情報抽出手段3a、3b、3c
によって情報を抽出する。Next, the information extraction control means 2 selects the information extraction means 3 corresponding to the classification result of each page.
From pages classified as "job information", "event", and "present", information extraction means 3a, 3b, 3c respectively
To extract information.

【００５１】図６は、情報抽出手段3のそれぞれの情報
名定義手段32が格納する情報の種類の例を示す図であ
る。情報抽出手段3a、3b、3cの抽出情報定義手段32a、3
2b、33c、は、それぞれ図６の6a、6b、6cに示す情報を
格納している。例えば、「求人情報」に分類される文書
から情報を抽出する役目を情報抽出手段3aが持っている
ときには、抽出情報定義手段32aには、図６の6aに示す
情報をもち、抽出実行手段31aは、この抽出情報定義手
段32aの定義（この例では、[勤務地]、「職種」）の内
容に沿った情報を当該文書から抽出する。また、この例
の場合、情報抽出制御手段2には、複数の情報抽出手段
がそれぞれどの分類の情報を担当するかを判断する情報
が必要となる。FIG. 6 is a diagram showing an example of the type of information stored in each information name defining means 32 of the information extracting means 3. Extraction information definition means 32a, 3 of information extraction means 3a, 3b, 3c
2b and 33c store the information shown in 6a, 6b and 6c in FIG. 6, respectively. For example, when the information extracting means 3a has a role of extracting information from a document classified as "job information", the extracted information defining means 32a has the information shown in 6a of FIG. Extracts information according to the contents of the extraction information definition means 32a (in this example, [work location], "occupation type") from the document. Further, in the case of this example, the information extraction control means 2 needs information for judging which category of information is assigned to each of the plurality of information extraction means.

【００５２】図７は、抽出知識格納手段33が格納する情
報抽出の方法の例を示す図である。図７の(a)、(b)、
(c)はそれぞれ抽出知識格納手段33a、33b、33cに対応す
る。FIG. 7 is a diagram showing an example of a method of extracting information stored in the extracted knowledge storage means 33. (A), (b),
(c) corresponds to the extracted knowledge storage means 33a, 33b, 33c, respectively.

【００５３】図７では、図６に示した抽出情報定義手段
32の各タイプの「情報名」ごとに、どのような方法で情
報を見つけるかを記述している。図７中で、「記述種
類」は、どこに書かれている情報を抽出するかを指定し
ている。例えば、「見出し」であれば表や箇条書きの中
で「パタン」に書かれている見出しが付いている部分を
抽出する。また、「指定タグ」であれば特定のタグ内を
抽出し、さらに「テキスト」であればテキスト中で「パ
タン」に一致する部分を抽出する。In FIG. 7, the extraction information defining means shown in FIG.
It describes how to find information for each of the 32 types of "information name". In FIG. 7, "description type" designates where information written is extracted. For example, in the case of "heading", a portion having a heading written in "pattern" in a table or a bullet point is extracted. In addition, if it is “specified tag”, a specific tag is extracted, and if it is “text”, a part matching the “pattern” in the text is extracted.

【００５４】また、図７中のパタンで%dや%sとあるのは
変数の意味で、任意の文字列が入ることを表す。数値の
範囲や文字の種類や長さなど、詳細な指定を加えられる
書式にすることで、より正確に抽出にすることができ
る。複数の抽出方法がある場合は、「優先度」によって
順序づけしている。In the pattern shown in FIG. 7, "% d" or "% s" means a variable, meaning that an arbitrary character string is entered. By using a format that allows detailed specification such as the range of numeric values, the type and length of characters, etc., extraction can be performed more accurately. If there are multiple extraction methods, they are ordered by “priority”.

【００５５】図７の例で、「求人情報」タイプの「勤務
地」と「イベント」タイプの「開催地」はいずれも場所
に関する情報であるが、異なるパタンを利用して抽出が
行われるため、区別して抽出できる。同様に「イベン
ト」タイプの「開催日」と「プレゼント」タイプの「応
募〆切」も日付であるが区別して抽出できる。In the example of FIG. 7, the "job information" type "working place" and the "event" type "holding place" are both information relating to a place, but are extracted using different patterns. , Can be distinguished and extracted. Similarly, the “event” type “holding date” and the “present” type “application deadline” can be extracted separately, although they are dates.

【００５６】よって、図５の例で「求人情報」に分類さ
れた“y.html”については、図６の6aから、「勤務地」
と「職種」を抽出し、抽出の際には、図７の7aの抽出方
法を参照して抽出することになる。Therefore, “y.html” classified as “recruitment information” in the example of FIG.
And "occupation type" are extracted, and the extraction is performed with reference to the extraction method 7a in FIG.

【００５７】図８は、「求人情報」に分類されるWebペ
ージの表示イメージの例である。実際のファイルでは、
タグによって表や箇条書きを表現している。ここから情
報を抽出する場合について説明する。FIG. 8 is an example of a display image of a Web page classified as "job information". In the actual file,
Tables and bullets are represented by tags. A case where information is extracted from here will be described.

【００５８】「勤務地」については、図７(a)によっ
て、表や箇条書きの中から「勤務地」あるいは「勤務場
所」、「営業所」という見出しのある項目を探し、「川
崎市」を抽出する。同様に「職種」の項から「システム
エンジニア」を抽出する。As for "work location", an item with a heading "work location" or "work location" or "sales office" is searched from the table or the list in FIG. 7 (a). Is extracted. Similarly, “system engineer” is extracted from the “occupation” section.

【００５９】図５の例で「イベント」に分類された“x.
html”については、図６の6bから、「名称」と「開催
地」、「開催日」を抽出する。In the example of FIG. 5, “x.
As for “html”, “name”, “venue”, and “date” are extracted from 6b in FIG.

【００６０】図９は、「イベント」のWebページの表示
イメージの例である。このページから箇条書きの見出し
を元にして、「名称」と「開催地」を抽出する。「開催
日」は箇条書きによる記述がないので、図７(b)からテ
キスト中からパタンに一致する文字列を探し、「1999年
10月10日」を抽出する。FIG. 9 is an example of a display image of a web page of “event”. From this page, extract the "name" and "venue" based on the bulleted headings. Since "Date" is not described in a bulleted list, a character string that matches the pattern is searched from the text in FIG.
October 10 "is extracted.

【００６１】図１０は、各ページから抽出した情報の例
を示す図である。ページ名、分類結果、情報名、抽出し
た内容が記述されている。FIG. 10 is a diagram showing an example of information extracted from each page. The page name, classification result, information name, and extracted contents are described.

【００６２】なお、抽出情報定義手段32で定義されてい
るすべての情報が抽出できるとは限らない。たとえば、
「イベント」に分類された文書から「名称」と「開催
日」が抽出され「開催地」は抽出されない場合もありう
る。Note that not all information defined by the extraction information definition means 32 can be extracted. For example,
The “name” and “date” may be extracted from the document classified as “event” and the “venue” may not be extracted.

【００６３】次に、情報分類制御手段4によって各ペー
ジの分類結果に対応する情報分類手段5を選択する。
「求人情報」、「イベント」、「プレゼント」に分類さ
れたページから抽出された情報は、それぞれ、情報分類
手段5a、5b、5cによって分類する。Next, the information classification control means 4 selects the information classification means 5 corresponding to the classification result of each page.
Information extracted from pages classified as “job information”, “event”, and “present” is classified by the information classifying means 5a, 5b, and 5c, respectively.

【００６４】図１１は、分類知識格納手段52に記述する
分類方法の概要を示す図である。各情報によって分類す
る種類と、方法を記述する。図１１の(a)、(b)、(c)は
それぞれ分類知識格納手段53a、53b、53cに対応する。FIG. 11 is a diagram showing an outline of a classification method described in the classification knowledge storage means 52. Describe the type and method of classification based on each information. (A), (b), and (c) of FIG. 11 correspond to the classification knowledge storage means 53a, 53b, and 53c, respectively.

【００６５】図１２は、分類結果を表す図である。図１
２で「イベント」の開催日の分類は6桁の数字で年と月
で表している。「プレゼントの」〆切は週単位に分類し
ているので分類に対応する週の日曜日の日付で表してい
る。FIG. 12 is a diagram showing the classification result. FIG.
The classification of the date of “Event” in 2 is expressed by year and month with 6 digits. The “present” deadline is classified on a weekly basis, and is represented by a Sunday date of the week corresponding to the classification.

【００６６】分類実行手段51は、分類知識格納手段52の
記述に従って情報を分類する。分類の方法としては、・各分類に含まれる単語のリストを用いる・都道府県ごとに市町村名や施設名を記述したシソーラ
スを用いる・単語シソーラスを用いて上位概念にまとめるなどの方
法がある。The classification executing means 51 classifies information according to the description of the classification knowledge storage means 52. As a method of classification, there is a method of using a list of words included in each classification, using a thesaurus in which the names of municipalities and facilities are described for each prefecture, and using a word thesaurus to put together a general concept.

【００６７】情報の種類ごとに分類方法を指定すること
で、図１１の「イベント」タイプの「開催日」と「プレ
ゼント」タイプの「応募〆切」のように、同じ日付の情
報であっても分類を変えることもできる。By specifying a classification method for each type of information, information of the same date, such as “event date” of “event” type and “application deadline” of “present” type in FIG. Can also be re-classified.

【００６８】最後に、結果出力手段6によって、抽出し
た情報と分類結果を出力する。Finally, the result output means 6 outputs the extracted information and the classification result.

【００６９】出力方法としては、抽出した情報をCVSフ
ァイル形式など、検索システムやデータベースシステム
などのシステムに登録できる形式で記憶装置上に出力す
るほか、図１２のような一覧表やHTMLなどの表示できる
形式、その他XML、SGMLなどの構造化した文書形式が使
用できる。As an output method, in addition to outputting the extracted information to a storage device in a format such as a CVS file format that can be registered in a search system or a database system, a list table as shown in FIG. You can use any structured document format such as XML, SGML, etc.

【００７０】なお、本実施例は、HTMLに限らず、SGMLや
XMLなど構造化された文書に対して同様に機能する。Note that the present embodiment is not limited to HTML,
Works similarly for structured documents such as XML.

【００７１】［別の実施例］次に、別の実施例を用いて
本実施の形態の動作を説明する。[Another Example] Next, the operation of this embodiment will be described using another example.

【００７２】ここでは、新聞記事から情報を抽出する場
合を説明する。Here, a case where information is extracted from newspaper articles will be described.

【００７３】まず、新聞記事をあらかじめ定めたカテゴ
リーに分類する。記事中に含まれる単語を元に文書を分
類する従来システム(たとえばジャストシステム社のCB
Classifier（商標）など)が利用できる。たとえば、記
事を「国際政治」、「新製品情報」、「スポーツ」に分
類する。First, newspaper articles are classified into predetermined categories. A conventional system that classifies documents based on words contained in articles (for example, Just System CB
Classifier (trademark)) can be used. For example, articles are classified into "international politics", "new product information", and "sports".

【００７４】次に、情報抽出制御手段2によって各記事
の分類に対応する情報抽出手段3を選択し情報を抽出す
る。Next, the information extraction control means 2 selects the information extraction means 3 corresponding to the classification of each article and extracts information.

【００７５】図１３は、情報名定義手段32が格納する情
報の種類の例を示す図である。たとえば、「国際政治」
に分類されたページからは「地名」と「関係者」の情報
を抽出する(図１３の13a)。FIG. 13 is a diagram showing an example of the types of information stored by the information name definition means 32. For example, "international politics"
The information of “place name” and “related person” is extracted from the page classified as (13a in FIG. 13).

【００７６】図１４は、抽出知識格納手段23が格納す
る、情報抽出の方法の例を示す図である。新聞記事が構
造化されていないテキストであるので、テキスト中での
パタンマッチで情報を抽出する。「国際政治」の記事か
ら抽出する「関係者」については肩書きを用いて国の代
表者レベルの人物に限定している（図１４(a)）。FIG. 14 is a diagram showing an example of an information extraction method stored in the extracted knowledge storage means 23. Since a newspaper article is an unstructured text, information is extracted by pattern matching in the text. “Related parties” extracted from articles on “international politics” are limited to persons at the national representative level using titles (FIG. 14 (a)).

【００７７】図１４(a)、(c)では、「地名」と「競技
名」については、あらかじめリストを作成してテキスト
を探索する。In FIGS. 14 (a) and 14 (c), lists of "place names" and "competition names" are created in advance to search for text.

【００７８】図１５は、新聞記事から抽出した情報の例
を示す図である。記事番号、分類結果、情報名、抽出し
た内容が記述されている。FIG. 15 is a diagram showing an example of information extracted from newspaper articles. The article number, classification result, information name, and extracted contents are described.

【００７９】次に、情報分類制御手段4によって、記事
の分類結果に対応した情報分類手段5を選択し、記事か
ら抽出した情報を分類する。図１６は、分類知識格納手
段52に記述する分類方法の概要を示す図である。各情報
によって分類する種類と、方法を記述する。図１７は、
分類結果を表す図である。Next, the information classification control means 4 selects the information classification means 5 corresponding to the article classification result, and classifies the information extracted from the article. FIG. 16 is a diagram showing an outline of a classification method described in the classification knowledge storage means 52. Describe the type and method of classification based on each information. FIG.
It is a figure showing a classification result.

【００８０】最後に、結果出力手段6によって、抽出し
た情報と分類結果を出力する。Finally, the result output means 6 outputs the extracted information and the classification result.

【００８１】（第２の実施の形態）次に、本発明の第２
の実施の形態について図面を参照して詳細に説明する。(Second Embodiment) Next, a second embodiment of the present invention will be described.
An embodiment will be described in detail with reference to the drawings.

【００８２】図１８を参照すると、本発明の第２の実施
の形態は、文書分類手段1と、情報抽出制御手段2と、複
数の情報抽出手段3と、情報分類制御手段4と、複数の情
報分類手段5と、結果選択手段7と、結果出力手段6から
構成されている。Referring to FIG. 18, according to a second embodiment of the present invention, a document classification unit 1, an information extraction control unit 2, a plurality of information extraction units 3, an information classification control unit 4, It comprises information classification means 5, result selection means 7, and result output means 6.

【００８３】次に、図１８および図１９のフローチャー
トを参照して本実施の形態の全体の動作について詳細に
説明する。Next, the overall operation of the present embodiment will be described in detail with reference to the flowcharts of FIGS.

【００８４】文書分類手段1、情報抽出制御手段2、情報
抽出手段3、情報分類制御手段4、情報分類手段5の動作
(ステップ1901〜1905)は第１の形態と同じである。Operation of Document Classification Means 1, Information Extraction Control Means 2, Information Extraction Means 3, Information Classification Control Means 4, Information Classification Means 5
(Steps 1901-1905) are the same as in the first embodiment.

【００８５】結果選択手段7は、抽出・分類された情報
のうち、特定の情報のみを選択して結果出力手段186に
渡す(ステップ1906)。選択基準としては、・文書の分類を指定する・情報の分類を指定する・抽出した情報に条件を指定する・特定の情報が抽出できた文書からの情報のみを選択す
るなどがある。The result selecting means 7 selects only specific information from the extracted and classified information and transfers it to the result output means 186 (step 1906). The selection criteria include:-Specifying the classification of the document-Specifying the classification of the information-Specifying the conditions for the extracted information-Selecting only the information from the document from which the specific information was extracted.

【００８６】結果出力手段186は、結果選択手段5の出力
を受け取り、情報の分類結果に応じて複数に分割して出
力する(ステップ1907)。分割方法としては、・文書の分類ごとに分割する・情報の分類ごとに分割する・複数の分類の組み合わせごとに分割するなどがある。The result output means 186 receives the output of the result selection means 5 and divides it into a plurality of pieces according to the result of classification of information and outputs it (step 1907). Examples of the division method include: • division for each document classification; • division for each information classification; and • division for each combination of a plurality of classifications.

【００８７】次に、本実施の形態の効果について説明す
る。Next, effects of the present embodiment will be described.

【００８８】本実施の形態では、抽出あるいは分類した
情報から結果選択手段7によって、必要なものだけを選
択して出力するというように構成されているため、特定
の目的やユーザにあわせた情報抽出結果を提供できる。In this embodiment, since only the necessary information is selected and output by the result selecting means 7 from the extracted or classified information, the information extraction according to a specific purpose or a user is performed. Can provide results.

【００８９】また、本実施の形態では、さらに、結果出
力手段186によって、抽出した情報を分類して分割して
出力するというように構成されている。文書の分類と抽
出した情報の分類を組み合わせることで、特定の情報を
持つ文書を他と区別することができ、特定の目的を持っ
て文書を探す場合に容易に目的を達成できる。Further, in the present embodiment, the result output means 186 is configured to classify the extracted information, divide it, and output it. By combining the classification of the document and the classification of the extracted information, a document having specific information can be distinguished from others, and the purpose can be easily achieved when searching for a document with a specific purpose.

【００９０】［実施例］次に、具体的な実施例を用いて
本実施の形態の動作を説明する。[Example] Next, the operation of this embodiment will be described using a specific example.

【００９１】ここでは、Webページ(HTMLファイル)から
抽出した情報を例に説明する。Here, information extracted from a Web page (HTML file) will be described as an example.

【００９２】情報分類制御手段4および情報分類手段5に
よる分類結果として図７に示す形式の情報が得られる。As a classification result by the information classification control means 4 and the information classification means 5, information in the format shown in FIG. 7 is obtained.

【００９３】次に、結果選択手段5は結果から特定の情
報を選択する。選択方法の例として、・「イベント」ページから抽出された情報を選択する・「勤務地」が関東である「求人情報」から抽出された
情報を選択する・「開催日」が現在の日時より先である「イベント」ペ
ージから抽出された情報を選択する・「賞品」と「応募締め切り」の両方が抽出できた「プ
レゼント」ページから抽出された情報を選択するなどが
ある。Next, the result selecting means 5 selects specific information from the result. As an example of the selection method: ・ Select the information extracted from the “Event” page ・ Select the information extracted from “Job information” where the “Work location” is Kanto ・ The “Date” is from the current date and time Select information extracted from the previous “Event” page. ・ Select information extracted from the “Present” page from which both “Prizes” and “Application deadline” have been extracted.

【００９４】最後に、結果出力手段184は、結果選択手
段5の出力を受け取り、情報の分類結果に応じて複数に
分割して出力する。分割方法としては、・ページの分類ごとに分割する・「イベント」ページの情報を「開催地」の都道府県別
に分割する・「イベント」ページを「開催日」の月別に分割する・「求人情報」ページの情報を、「勤務地」の都道府県
別分類と「職種」の分類組み合わせで分割する。この場
合、（47都道府県×職種分類の数）通りに分割すること
になるなどがある。Finally, the result output means 184 receives the output of the result selection means 5 and divides the output into a plurality of pieces according to the result of classification of the information and outputs it. The division method is as follows:-Divide by page classification-Divide information on "Event" page by prefecture of "Place"-Divide "Event" page by month of "Date"-"Job information" The page information is divided according to the prefectural classification of “work location” and the classification combination of “occupation type”. In this case, the division may be performed in (47 prefectures × the number of job classifications).

【００９５】さらに、ページの分類と抽出された情報の
分類を階層的に組み合わせた出力形式を使用すること
で、特定の情報を含んでいる文書を効率よく見つけ出す
ことができる。階層構造の実現方法としては、たとえ
ば、HTMLやXMLのハイパーテキストの機能を使うことで
実現できる。Further, by using an output format in which page classification and extracted information classification are hierarchically combined, it is possible to efficiently find a document containing specific information. As a method of realizing the hierarchical structure, for example, it can be realized by using a hypertext function of HTML or XML.

【００９６】図２０は、結果出力手段186の出力結果の
例を示す図である。この例では、文書の分類結果の一覧
(2001)から、各分類の文書から抽出された情報の分類に
リンクがはられている。さらに、情報の分類から個々の
文書名の一覧を参照できる。FIG. 20 is a diagram showing an example of the output result of the result output means 186. In this example, a list of document classification results
From (2001), there is a link to the classification of information extracted from documents of each classification. Further, a list of individual document names can be referenced from the information classification.

【００９７】たとえば、「求人情報」からのリンクの一
つとして「勤務地」と「職種」の組み合わせの一覧(200
2)がリンクし、さらに該当する求人情報の一覧(2003)に
リンクしている。「イベント」については、「開催日」
による分類(2004)からイベント一覧(2005)にリンクして
いる。For example, as one of the links from the “job information”, a list of combinations of “work place” and “occupation” (200
2) is linked, and further to the list of relevant job postings (2003). For "Event", "Date"
It links to the event list (2005) from the classification (2004).

【００９８】このような出力形式を用いることで、出力
結果から目的とする文書を見つけ出すことが容易にな
る。By using such an output format, it is easy to find a target document from the output result.

【００９９】（第３の実施の形態）次に、本発明の第３
の実施の形態について図面を参照して詳細に説明する。(Third Embodiment) Next, a third embodiment of the present invention will be described.
An embodiment will be described in detail with reference to the drawings.

【０１００】図２１を参照すると、本発明の第３の実施
の形態は、文書分類手段2101と、情報抽出制御手段2
と、情報抽出手段3と、情報分類制御手段4と、情報分類
手段5と、結果選択手段7と、結果出力手段6から構成さ
れている。Referring to FIG. 21, according to a third embodiment of the present invention, a document classification unit 2101 and an information extraction control unit 2
, An information extraction means 3, an information classification control means 4, an information classification means 5, a result selection means 7, and a result output means 6.

【０１０１】次に、図１８を参照して本実施の形態の全
体の動作について詳細に説明するが、情報抽出制御手段
2、情報抽出手段3、情報分類制御手段4、情報分類手段
5、情報出力手段6の動作は第１の実施の形態と同じであ
るため説明を省略する。Next, the overall operation of the present embodiment will be described in detail with reference to FIG.
2, information extraction means 3, information classification control means 4, information classification means
5. The operation of the information output means 6 is the same as that of the first embodiment, and the description is omitted.

【０１０２】本実施の形態の情報分類手段2101は、文書
を分類して特定のカテゴリーに含まれる文書のみを情報
抽出制御手段2に渡す。The information classifying means 2101 of this embodiment classifies documents and transfers only documents included in a specific category to the information extraction control means 2.

【０１０３】次に、本実施の形態の効果について説明す
る。Next, effects of the present embodiment will be described.

【０１０４】本実施の形態では、特定のカテゴリーに分
類される文書のみを対象にして情報抽出処理を行うた
め、必要な情報を効率よく抽出することができる。In the present embodiment, since information extraction processing is performed only on documents classified into a specific category, necessary information can be efficiently extracted.

【０１０５】なお、本実施の形態の特殊な場合として、
特定のカテゴリーが１種類だけの場合は、カテゴリーに
よる情報抽出手段、情報分類手段の選択が不要になるた
め情報抽出制御手段と情報分類制御手段を省略して構成
できる(図２２)。As a special case of the present embodiment,
When there is only one specific category, it is not necessary to select an information extraction unit and an information classification unit based on the category, so that the information extraction control unit and the information classification control unit can be omitted (FIG. 22).

【０１０６】また、本実施の形態における情報抽出装置
をコンピュータによって実現するには、第１の実施の形
態であれば、文書分類手段１、情報抽出制御手段２、情
報抽出手段３、情報分類制御手段４、情報分類手段５、
結果出力手段６の各機能を実現するコンピュータプログ
ラムを作成し、そのコンピュータプログラムをＣＤ−Ｒ
ＯＭやフロッピー（登録商標）ディスクや半導体メモリ
に代表される記録媒体に記録しておき、コンピュータ側
では、このプログラムが記録された記録媒体を読み出す
ことにより、コンピュータに上記各機能を生成すれば、
本発明の情報検索装置をコンピュータによって実現する
ことができる。また、このコンピュータプログラムは、
例えばサーバ内の記録装置に記録されている形態でもか
まわなく、ネットワークを介してこのサーバ内に含まれ
るプログラムを提供する形態でもよい。Further, in order to realize the information extraction device in this embodiment by a computer, in the case of the first embodiment, the document classification means 1, the information extraction control means 2, the information extraction means 3, the information classification control Means 4, information classification means 5,
A computer program for realizing each function of the result output means 6 is created, and the computer program is stored in a CD-R
If the above functions are generated in a computer by recording the program on a recording medium represented by an OM, a floppy (registered trademark) disk or a semiconductor memory, and reading out the recording medium on which the program is recorded on the computer side,
The information search device of the present invention can be realized by a computer. Also, this computer program
For example, the program may be recorded in a recording device in the server, or a program provided in the server may be provided via a network.

【０１０７】[0107]

【発明の効果】本発明の第１の効果は、情報を正確に抽
出できることにある。その理由は、文書を分類して、分
類結果に応じた情報抽出方法を定義するためである。A first effect of the present invention is that information can be accurately extracted. The reason is to classify documents and define an information extraction method according to the classification result.

【０１０８】また、本発明の第２の効果は、必要な情報
だけを抽出できることにある。その理由は、文書を分類
して分類に応じて重要な情報だけを抽出するためであ
る。A second effect of the present invention is that only necessary information can be extracted. The reason is that documents are classified and only important information is extracted according to the classification.

【０１０９】さらに、本発明の第３の効果は、大量の文
書から情報を抽出できることにある。その理由は、正確
に情報を抽出できるため人手を使わず自動的に情報を抽
出できるためである。A third effect of the present invention is that information can be extracted from a large number of documents. The reason is that the information can be extracted accurately, so that the information can be automatically extracted without using any manual operation.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態の構成を示すブロッ
ク図である。FIG. 1 is a block diagram showing a configuration of a first exemplary embodiment of the present invention.

【図２】第１の実施の形態の情報抽出手段3の構成を示
すブロック図である。FIG. 2 is a block diagram illustrating a configuration of an information extracting unit 3 according to the first embodiment.

【図３】第１の実施の形態の情報分類手段5の構成を示
すブロック図である。FIG. 3 is a block diagram illustrating a configuration of an information classification unit 5 according to the first embodiment.

【図４】第１の実施の形態の動作を示す流れ図である。FIG. 4 is a flowchart showing the operation of the first embodiment.

【図５】第１の実施例でWebページに対して適合度を計
算した例を示す図である。FIG. 5 is a diagram illustrating an example of calculating a degree of suitability for a Web page in the first embodiment.

【図６】第１の実施例の抽出情報定義手段32が格納する
情報の種類の例を示す図である。FIG. 6 is a diagram illustrating an example of types of information stored in an extraction information defining unit 32 according to the first embodiment.

【図７】第１の実施例の抽出知識格納手段33が格納す
る、情報抽出の方法の例を示す図である。FIG. 7 is a diagram illustrating an example of an information extraction method stored in an extracted knowledge storage unit 33 according to the first embodiment.

【図８】第１の実施例で「求人情報」のWebページの表
示イメージの例である。FIG. 8 is an example of a display image of a web page of “job information” in the first embodiment.

【図９】第１の実施例で「イベント」のWebページの表
示イメージの例である。FIG. 9 is an example of a display image of a web page of “event” in the first embodiment.

【図１０】第１の実施例の情報抽出手段3が抽出した情
報の例を示す図である。FIG. 10 is a diagram showing an example of information extracted by the information extracting means 3 of the first embodiment.

【図１１】第１の実施例の分類知識格納手段52に記述す
る分類方法の概要を示す図である。FIG. 11 is a diagram showing an outline of a classification method described in the classification knowledge storage means 52 of the first embodiment.

【図１２】第１の実施例の分類結果を表す図である。FIG. 12 is a diagram illustrating a classification result of the first embodiment.

【図１３】第２の実施例の抽出情報定義手段32が格納す
る情報の種類の例を示す図である。FIG. 13 is a diagram illustrating an example of types of information stored by an extraction information defining unit 32 according to the second embodiment.

【図１４】第２の実施例の抽出知識格納手段33が格納す
る、情報抽出の方法の例を示す図である。FIG. 14 is a diagram illustrating an example of an information extraction method stored in an extracted knowledge storage unit 33 according to the second embodiment.

【図１５】第２の実施例の情報抽出手段3が抽出した情
報の例を示す図である。FIG. 15 is a diagram illustrating an example of information extracted by the information extracting unit 3 of the second embodiment.

【図１６】第２の実施例の分類知識格納手段52に記述す
る分類方法の概要を示す図である。FIG. 16 is a diagram showing an outline of a classification method described in the classification knowledge storage means 52 of the second embodiment.

【図１７】第２の実施例の分類結果を表す図である。FIG. 17 is a diagram illustrating a classification result of the second embodiment.

【図１８】本発明の第２の実施の形態の構成を示すブロ
ック図である。FIG. 18 is a block diagram showing a configuration of a second exemplary embodiment of the present invention.

【図１９】第２の実施の形態の動作を示す流れ図であ
る。FIG. 19 is a flowchart showing the operation of the second embodiment.

【図２０】第２の実施の形態の実施例の出力手段1806の
出力例を示す図である。FIG. 20 is a diagram illustrating an output example of an output unit 1806 according to the example of the second embodiment.

【図２１】本発明の第３の実施の形態の構成を示すブロ
ック図である。FIG. 21 is a block diagram showing a configuration of a third exemplary embodiment of the present invention.

【図２２】第３の実施の形態の別の構成を示すブロック
図である。FIG. 22 is a block diagram showing another configuration of the third embodiment.

【図２３】従来の文書検索装置の動作を示す流れ図であ
る。FIG. 23 is a flowchart showing the operation of a conventional document search device.

[Explanation of symbols]

1、2101 文書分類手段 2 情報抽出制御手段 3 情報抽出手段 31 抽出実行手段 32 抽出情報定義手段 33 抽出知識格納手段 4 情報分類制御手段 5 情報分類手段 51 分類実行手段 52 分類知識格納手段 6、186 結果出力手段 7 結果選択手段 1, 2101 Document classification means 2 Information extraction control means 3 Information extraction means 31 Extraction execution means 32 Extracted information definition means 33 Extracted knowledge storage means 4 Information classification control means 5 Information classification means 51 Classification execution means 52 Classification knowledge storage means 6, 186 Result output means 7 Result selection means

Claims

[Claims]

1. A document classifying means for classifying an input document, information extracting means for changing information to be extracted from the document and a method of extracting the information in accordance with the classification of the document, An information extraction system comprising at least an information extraction unit for classifying the information extracted by the information extraction unit.

2. A document classifying means for classifying individual documents included in an input document set into a plurality of categories; an extraction information defining means defining a type of information to be extracted from a document belonging to a specific category; Information extraction means for extracting information defined in the extraction information definition means from documents classified into the category with reference to the extraction information definition means, and appropriate information extraction according to the document classification result of the document classification means Information extraction control means for selecting information and extracting information from a document; information classification means for classifying information extracted from a document belonging to a specific category; Select the information classification method,
An information classification control unit for performing control for classifying information extracted from the document; a classification result by the document classification unit; information extracted by the information extraction unit; and a classification result by the information classification unit. An information extraction system comprising: a result output unit.

3. The information extraction system according to claim 2, wherein said document classification means includes a type discrimination means for discriminating a document type as a classification method for structured documents.

4. The information extracting system according to claim 2, wherein said document classifying means has a function of extracting a document belonging to a specific category.

5. The information extraction system according to claim 2, wherein said result output means selects and outputs information belonging to a specific document classification or information classification.

6. The information extraction system according to claim 2, wherein the result output unit outputs the information extracted by the information extraction unit in a format having a hierarchical structure. .

7. A document classification step for classifying an input document, an information extraction step for changing information to be extracted from the document and a method for extracting the information in accordance with the classification of the document performed in the document classification step. An information classification step of classifying the information extracted in the information extraction step according to the classification of the document performed in the document classification step.

8. A document classification step of classifying individual documents included in an input document set into a plurality of categories, an extraction information definition step defining a type of information to be extracted from a document belonging to a specific category, An information extraction step of extracting information defined by the extraction information definition step from documents classified into the category with reference to the extraction information definition step; and an appropriate information extraction step according to a document classification result of the document classification step , An information extraction control step of performing control for extracting information from a document by selecting a document, an information classification step of classifying information extracted from a document belonging to a specific category, and an appropriate information according to the document classification result. An information classification control step of selecting a classification step and performing control for classifying information extracted from the document; Classification result by the document classification step, the information extracted information extracted by the step, the information extraction method characterized by comprising the result output step of outputting a classification result, by the information classification step.

9. The information extraction method according to claim 8, wherein said document classification step includes a type determination step of determining a document type as a classification method for structured documents.

10. The information extraction method according to claim 8, wherein said document classification step has a function of extracting a document belonging to a specific category.

11. The information extraction method according to claim 8, wherein in the result output step, information belonging to a specific document classification or information classification is selected and output.

12. The information extraction method according to claim 8, wherein the result output step outputs the information extracted in the information extraction step in a format having a hierarchical structure. .

13. A computer, comprising: a document classifying step of classifying an input document; information extracted from the document in accordance with the classification of the document performed in the document classifying step; and information changing a method of extracting the document. An information extraction program for executing at least an extraction step and an information classification step of classifying the information extracted in the information extraction step according to the classification of the document performed in the document classification step. recoding media.

14. A computer, comprising: a document classification step of classifying individual documents included in an input document set into a plurality of categories; and an extraction information definition step of defining types of information to be extracted from documents belonging to a specific category. An information extraction step of extracting information defined by the extraction information definition step from documents classified into the category with reference to the extraction information definition step; An information extraction control step of selecting an information extraction step to perform control for extracting information from a document; an information classification step of classifying information extracted from a document belonging to a specific category; Information that selects an appropriate information classification step and performs control for classifying information extracted from the document A class control step; a result output step of outputting a classification result by the document classification step; information extracted by the information extraction step; a classification result by the information classification step; and an information extraction program to be executed. Recording medium on which is recorded.

15. The method according to claim 1, wherein the document classification step includes a type determination step of determining a document type as a classification method for structured documents.
A recording medium on which the information extraction program according to item 4 is recorded.

16. The recording medium according to claim 14, wherein said document classification step has a function of extracting a document belonging to a specific category.

17. The information extraction program according to claim 14, wherein the result output step selects and outputs information belonging to a specific document classification or information classification. The recording medium on which it was recorded.

18. The information extracting apparatus according to claim 14, wherein the result outputting step outputs the information extracted by the information extracting step in a format having a hierarchical structure. A recording medium on which a program is recorded.