JP2003303199A

JP2003303199A - Knowledge information collecting system and knowledge information collecting method

Info

Publication number: JP2003303199A
Application number: JP2002108416A
Authority: JP
Inventors: Kazuhiko Atsumi; 一彦渥美; Masayo Toyoda; 真代豊田; Koji Shioda; 弘二塩田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-04-10
Filing date: 2002-04-10
Publication date: 2003-10-24
Anticipated expiration: 2022-04-10
Also published as: JP3725835B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize a mechanism for listing processing states of collection/ registering for each document collection. <P>SOLUTION: In a Web collecting module 111, a status information file for managing a progress state related to the document collection is produced in starting each document collection, and status information indicating the progress state of the document collection is written into the status information file corresponding to the document collection. In a registering module 12, a status indicating the progress state of the registering is written into the status information file corresponding to the document collection to be registered in response to the progress of the started registering. A status list screen indicating the present progress state related to each document collection is displayed on a screen through a managing interface 112 based on status information held in each status information file. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明はナレッジマネジメン
トシステムで用いられる知識情報収集システムおよび知
識情報収集方法に関し、特に知識データベースに登録す
べき文書情報をネットワーク上から収集するための知識
情報収集システムおよび知識情報収集方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a knowledge information collecting system and a knowledge information collecting method used in a knowledge management system, and particularly to a knowledge information collecting system and knowledge for collecting document information to be registered in a knowledge database from a network. Regarding information collection method.

【０００２】[0002]

【従来の技術】近年、企業を中心に複数のユーザ間で情
報共有を行うためのグループウェアの導入が進められて
いる。代表的なグループウェアとしては、電子メールシ
ステムやワークフローシステムなどが知られているが、
最近では、知識情報の共有支援を図るためのナレッジマ
ネジメントシステムも開発され始めている。2. Description of the Related Art In recent years, the introduction of groupware for sharing information among a plurality of users has been promoted mainly in companies. E-mail systems and workflow systems are known as typical groupware,
Recently, a knowledge management system to support sharing of knowledge information is also being developed.

【０００３】このナレッジマネジメントシステムは、個
人のノウハウなどの知識情報を知識データベースに蓄積
・管理するためのものであり、自然言語検索などの検索
機能と組み合わせることにより、蓄積された知識情報の
効率的な活用が可能となる。This knowledge management system is for accumulating and managing knowledge information such as personal know-how in a knowledge database. By combining it with a search function such as natural language search, the accumulated knowledge information can be efficiently stored. It is possible to make full use of it.

【０００４】ところで、このようなナレッジマネジメン
トシステムにおいては、知識情報をいかに効率よく収集
・蓄積するかが重要なポイントとなる。特にインターネ
ット上のＷｅｂ情報は知識の宝庫であるので、インター
ネット上から必要な情報を効率良く収集するための仕組
みが必要とされている。By the way, in such a knowledge management system, how to efficiently collect and store knowledge information is an important point. In particular, since Web information on the Internet is a treasure trove of knowledge, a mechanism for efficiently collecting necessary information from the Internet is needed.

【０００５】[0005]

【発明が解決しようとする課題】しかし、インターネッ
トからのＷｅｂ情報の収集に際してはリンク情報を辿り
ながら互いに関連する大量の文書ファイル群それぞれを
順次収集することが必要となるので、その収集処理には
通常多くの時間を要する。さらに、インターネットから
収集した文書情報を知識として使用できるようにするた
めには、その文書情報をナレッジマネジメントシステム
の知識データベースに登録することが必要であるので、
その登録処理にも時間を要することとなる。However, when collecting Web information from the Internet, it is necessary to sequentially collect a large number of document file groups related to each other while following the link information. It usually takes a lot of time. Furthermore, in order to use the document information collected from the Internet as knowledge, it is necessary to register the document information in the knowledge database of the knowledge management system.
The registration process also takes time.

【０００６】このため、特に、収集起点が異なる複数の
収集処理の設定を行ってその収集結果を知識データベー
スに登録するような場合には、設定した収集処理毎にそ
の進捗状況は大きく異なるので、どの収集処理がどのよ
うな収集／登録の処理状況であるかを把握することは困
難となる。また、Ｗｅｂ情報の収集と、例えばデータベ
ースなどの他の情報源からの情報収集などとを同時に行
うような場合も、同様の問題が生じることになる。For this reason, particularly when a plurality of collection processes having different collection starting points are set and the collection results are registered in the knowledge database, the progress status greatly differs for each set collection process. It becomes difficult to understand which collection processing is what kind of collection / registration processing situation. Also, similar problems occur when collecting Web information and collecting information from other information sources such as a database at the same time.

【０００７】本発明はこのような事情を考慮してなされ
たものであり、複数の文書収集処理それぞれの収集／登
録の処理状況を容易に把握することが可能な知識情報収
集システムおよび知識情報収集方法を提供することを目
的とする。The present invention has been made in consideration of such circumstances, and a knowledge information collecting system and a knowledge information collecting system capable of easily grasping the collecting / registering processing status of each of a plurality of document collecting processings. The purpose is to provide a method.

【０００８】[0008]

【課題を解決するための手段】上述の課題を解決するた
め、本発明は、知識データベースに登録すべき文書情報
をネットワーク上から収集する知識情報収集システムに
おいて、前記ネットワーク上からの文書ファイル群の収
集条件を示す複数の設定情報それぞれに基づいて、前記
複数の設定情報それぞれに対応する複数の文書収集処理
を実行する文書収集手段であって、処理対象の文書収集
処理毎にその開始時に当該文書収集処理に関する収集開
始から登録までの間における進捗状況を管理するための
ステータス情報ファイルを作成し、前記処理対象の文書
収集処理の進捗に合わせて、当該文書収集処理に対応す
るステータス情報ファイルに当該文書収集処理の進捗状
況を示すステータス情報を書き込む文書収集手段と、前
記各文書収集処理の完了の度に前記文書収集手段から発
行される登録要求を受付け、その登録要求の受付け順
に、当該登録要求によって登録要求された文書収集処理
の結果を前記知識データベースに登録するための登録処
理を順次実行する登録手段であって、実行を開始した登
録処理の進捗に合わせて、当該登録処理の対象となって
いる文書収集処理に対応するステータス情報ファイルに
当該登録処理の進捗状況を示すステータス情報を書き込
む登録手段と、前記文書収集処理が開始された文書収集
処理それぞれに対応したステータス情報ファイルに保持
されているステータス情報に基づいて、前記文書収集処
理それぞれに関する現在の進捗状況を示すステータス一
覧画面を表示するステータス表示手段とを具備すること
を特徴とする。In order to solve the above-mentioned problems, the present invention is a knowledge information collecting system for collecting document information to be registered in a knowledge database from a network. Document collection means for executing a plurality of document collection processes corresponding to each of the plurality of setting information based on each of the plurality of setting information indicating a collection condition, the document collection process being performed for each document collection process to be processed. A status information file for managing the progress of collection processing from the start of collection to registration is created, and the status information file corresponding to the document collection processing is created in accordance with the progress of the document collection processing to be processed. Document collecting means for writing status information indicating the progress of the document collecting process, and each document collecting process A registration request issued from the document collecting means each time it is completed, and a registration process for registering the result of the document collection process requested by the registration request in the knowledge database in the order in which the registration request is received. The registration means to be executed, and in accordance with the progress of the registration process that has started execution, status information indicating the progress status of the registration process is added to the status information file corresponding to the document collection process that is the target of the registration process. A status list screen showing the current progress status of each of the document collection processes based on the registration means to be written and the status information held in the status information file corresponding to each of the document collection processes for which the document collection process is started. And a status display means for displaying.

【０００９】この知識情報収集システムでは、文書収集
手段においては、各文書収集処理の開始時に当該文書収
集処理に関する収集開始から登録までの間における進捗
状況を管理するためのステータス情報ファイルが作成さ
れ、その文書収集処理の進捗に合わせて、当該文書収集
処理に対応するステータス情報ファイルに当該文書収集
処理の進捗状況を示すステータス情報が書き込まれる。
そして、各文書収集処理の完了の度に文書収集手段から
登録要求が発行され、その登録要求の受付け順に、登録
手段による登録処理が実行される。この登録手段におい
ては、実行を開始した登録処理の進捗に合わせて、当該
登録処理の対象となっている文書収集処理に対応するス
テータス情報ファイルに当該登録処理の進捗状況を示す
ステータス情報が書き込まれる。このように、各文書収
集処理毎に作成されるステータス情報ファイルを用い
て、文書収集処理毎にその文書収集処理の進捗状況と登
録処理の進捗状況とを同一のステータス情報ファイルに
書き込むことにより、文書収集処理が開始された文書収
集処理それぞれに関する収集／登録の状況を個別に管理
することが出来る。そして、ステータス表示手段によ
り、文書収集処理それぞれに対応したステータス情報フ
ァイルに保持されているステータス情報に基づいて、文
書収集処理それぞれに関する現在の進捗状況を示すステ
ータス一覧画面が表示される。よって、複数の文書収集
処理それぞれの収集／登録の処理状況を容易に把握する
ことが可能となる。In this knowledge information collecting system, in the document collecting means, a status information file is created at the start of each document collecting process for managing the progress of the document collecting process from the collection start to the registration. In accordance with the progress of the document collection process, status information indicating the progress of the document collection process is written in the status information file corresponding to the document collection process.
A registration request is issued from the document collection means each time the document collection processing is completed, and the registration processing is executed by the registration means in the order in which the registration requests are received. In this registration means, the status information indicating the progress status of the registration process is written in the status information file corresponding to the document collection process that is the target of the registration process in accordance with the progress of the registration process that has started to be executed. . In this way, by using the status information file created for each document collection process and writing the progress status of the document collection process and the progress status of the registration process for each document collection process in the same status information file, It is possible to individually manage the status of collection / registration related to each document collection process in which the document collection process is started. Then, the status display unit displays a status list screen showing the current progress status of each document collection process based on the status information held in the status information file corresponding to each document collection process. Therefore, it is possible to easily grasp the collection / registration processing status of each of the plurality of document collection processing.

【００１０】また本発明は、知識データベースに登録す
べき文書情報を収集する知識情報収集システムにおい
て、互いに異なる複数種の情報源それぞれに対応して設
けられ、各々がそれに対応する情報源から文書ファイル
群を収集するための文書収集処理を実行する複数の文書
収集手段であって、各文書収集手段は、文書収集処理の
開始時に当該文書収集処理に関する収集開始から登録ま
での間における進捗状況を管理するためのステータス情
報ファイルを作成し、前記文書収集処理の進捗に合わせ
て、前記文書収集処理に対応するステータス情報ファイ
ルに当該文書収集処理の進捗状況を示すステータス情報
を書き込む複数の文書収集手段と、前記複数の文書収集
手段の各々がその文書収集処理を完了する度に発行する
登録要求を受付け、その登録要求の受付け順に、当該登
録要求によって登録要求された文書収集処理の結果を前
記知識データベースに登録するための登録処理を順次実
行する登録手段であって、実行を開始した登録処理の進
捗に合わせて、当該登録処理の対象となっている文書収
集処理に対応するステータス情報ファイルに当該登録処
理の進捗状況を示すステータス情報を書き込む登録手段
と、前記複数の文書収集手段よって開始された文書収集
処理それぞれに対応したステータス情報ファイルに保持
されているステータス情報に基づいて、前記複数の文書
収集手段それぞれの文書収集処理に関する収集開始から
登録までの間における現在の進捗状況を示すステータス
一覧画面を表示するステータス表示手段とを具備するこ
とを特徴とする。Further, according to the present invention, in a knowledge information collecting system for collecting document information to be registered in a knowledge database, the knowledge information collecting system is provided corresponding to each of a plurality of different information sources, and each of the information sources corresponds to a document file. A plurality of document collection means for executing a document collection process for collecting a group, wherein each document collection means manages the progress status from the collection start to the registration regarding the document collection process at the start of the document collection process. And a plurality of document collecting means for writing status information indicating the progress of the document collecting process to the status information file corresponding to the document collecting process in accordance with the progress of the document collecting process. Receiving a registration request issued each time each of the plurality of document collecting means completes the document collecting process, The registration means for sequentially executing the registration process for registering the result of the document collection process requested by the registration request in the knowledge database in the order in which the registration requests are received. In addition, a registration unit that writes status information indicating the progress status of the registration process to a status information file corresponding to the document collection process that is the target of the registration process, and the document collection started by the plurality of document collection units. Based on the status information held in the status information file corresponding to each processing, a status list screen showing the current progress status from the collection start to the registration regarding the document collection processing of each of the plurality of document collection means is displayed. And a status display unit for controlling the status.

【００１１】この知識情報収集システムでは、複数の文
書収集手段が設けられており、互いに異なる複数種の情
報源を対象にした文書収集処理がそれぞれ実行される
が、この場合においても、文書収集処理それぞれに対応
して作成されるステータス情報ファイルを用いて、文書
収集処理毎にその文書収集処理の進捗状況と登録処理の
進捗状況とを同一のステータス情報ファイルに書き込む
ことにより、例えば、ネットワークからの収集、ファイ
リングシステムからの収集、データベースからの収集な
どといった様々な収集処理それぞれに関する収集／登録
の状況を管理することが出来る。In this knowledge information collecting system, a plurality of document collecting means are provided, and the document collecting process for each of a plurality of different information sources is executed. Even in this case, the document collecting process is also performed. By writing the progress status of the document collection processing and the progress status of the registration processing for each document collection process in the same status information file using the status information files created corresponding to each, for example, from the network. It is possible to manage the status of collection / registration related to various collection processing such as collection, collection from filing system, collection from database, etc.

【００１２】[0012]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を説明する。図１には、本発明の一実施形態に係
る知識情報収集システムを用いたナレッジマネジメント
システム全体の構成が示されている。このナレッジマネ
ジメントシステムは知識情報の収集・分析・検索等のサ
ービスを提供するためのものであり、Ｗｅｂ収集システ
ム１１、登録モジュール１２、および知識エンジン１３
などから構成されている。これらＷｅｂ収集システム１
１、登録モジュール１２および知識エンジン１３は、サ
ーバコンピュータ上で実行されるプログラムとして実現
されている。Ｗｅｂ収集システム１１および登録モジュ
ール１２は、ナレッジマネジメントシステムで用いられ
る知識情報を収集するための知識情報収集システムを構
成する。この知識情報収集システムは、Ｗｅｂページな
どの文書をインターネット／イントラネット３０上から
収集してその文書の内容をナレッジマネジメントシステ
ムの知識データベース（知識ＤＢ）１３１に登録すると
いう処理を実行する。Ｗｅｂ収集システム１１は、Ｗｅ
ｂ収集モジュール１１１、管理インターフェース１１
２、および登録ディレクトリ１１３から構成されてい
る。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows the overall configuration of a knowledge management system using a knowledge information collecting system according to an embodiment of the present invention. This knowledge management system is for providing services such as collection, analysis, and search of knowledge information, and includes a Web collection system 11, a registration module 12, and a knowledge engine 13.
Etc. These Web collection system 1
1, the registration module 12, and the knowledge engine 13 are realized as a program executed on the server computer. The web collection system 11 and the registration module 12 configure a knowledge information collection system for collecting knowledge information used in the knowledge management system. This knowledge information collecting system executes a process of collecting a document such as a Web page from the Internet / Intranet 30 and registering the content of the document in a knowledge database (knowledge DB) 131 of the knowledge management system. Web collection system 11 is We
b collection module 111, management interface 11
2 and the registration directory 113.

【００１３】Ｗｅｂ収集モジュール１１１はインターネ
ット/イントラネット３０上の各種文書ファイルを収集
してそれを知識ＤＢ１３１に登録できる形式で出力する
ためプログラムである。このＷｅｂ収集モジュール１１
１はマルチスレッド構造を有しており、ＨＴＴＰ（Hype
rText Transfer Protocol）によるインターネット/イン
トラネット３０上からの文書ファイル取得処理を並列的
に行うことが出来る。ＨＴＴＰはインターネット/イン
トラネット３０上の情報サイト（Ｗｅｂサイト）である
各Ｗｅｂサーバ３１からＨＴＭＬ（Hyper Text Markup
Language）で記述された文書を取得するための通信プロ
トコルである。Ｗｅｂ収集モジュール１１１は、Ｗｅｂ
収集条件を指定する設定情報で指定されたＵＲＬ（Unif
orm Resource Locator）の文書ファイルをインターネッ
ト/イントラネット３０上から取得し、取得した文書フ
ァイルにリンク先のＵＲＬが含まれていればそのＵＲＬ
の文書ファイルをさらに取得する、という再帰的な処理
を行って、インターネット/イントラネット３０上から
関連する文書ファイル群を順次取得する。この場合、他
の文書ファイルへのリンク情報（ＵＲＬ）を検出するた
めには取得した文書ファイルを解析することが必要とな
るが、その解析処理の中では、リンク情報の検出のみな
らず、知識ＤＢ１３１に登録すべきテキストデータの抽
出も併せて実行される。The Web collection module 111 is a program for collecting various document files on the Internet / Intranet 30 and outputting them in a format that can be registered in the knowledge DB 131. This Web collection module 11
1 has a multi-thread structure, and HTTP (Hype
Document file acquisition processing from the Internet / Intranet 30 by rText Transfer Protocol) can be performed in parallel. HTTP is used by each Web server 31, which is an information site (Web site) on the Internet / Intranet 30, to send HTML (Hyper Text Markup).
Language) is a communication protocol for obtaining a document described in Language. Web collection module 111
URL specified in the setting information that specifies the collection condition (Unif
(orm Resource Locator) document file is acquired from the Internet / Intranet 30, and if the acquired document file contains the URL of the link destination, the URL
By further performing the recursive processing of further acquiring the document file of, the related document file group is sequentially acquired from the Internet / Intranet 30. In this case, it is necessary to analyze the acquired document file in order to detect the link information (URL) to another document file. In the analysis process, not only the detection of the link information but also the knowledge Extraction of text data to be registered in the DB 131 is also executed.

【００１４】収集対象の文書ファイルは、他のファイル
へのリンク情報であるＵＲＬを記述可能なハイパーテキ
ストファイル（ＨＴＭＬファイル）のみならず、テキス
トファイル（plane text)、さらには各種アプリケーシ
ョンプログラムによって作成された様々なファイル形式
の文書ファイル（例えば、Portable Document Format形
式のファイル、ワードプロセッサソフトウェアで作成し
た文書ファイル、表計算ソフトウェアで作成したファイ
ル、プレゼンテーションデータ作成ソフトウェアで作成
したプレゼンテーションデータのファイル、など）も収
集対象となる。The document file to be collected is created not only by a hypertext file (HTML file) that can describe a URL that is link information to another file, but also by a text file (plane text) and various application programs. Document files of various file formats (for example, Portable Document Format files, document files created by word processing software, files created by spreadsheet software, presentation data files created by presentation data creation software, etc.) Be the target.

【００１５】Ｗｅｂ収集モジュール１１１では、インタ
ーネット/イントラネット３０上から収集した各文書フ
ァイルに対して知識ＤＢ１３１に登録すべき属性情報の
取得と上述のテキストデータの抽出が行われる。各文書
ファイルから取得される属性情報は、例えば、当該文書
ファイルのＵＲＬ、ファイル作成日時（更新日時）等で
ある。そして、Ｗｅｂ収集モジュール１１１では、それ
ら各文書ファイルの属性情報とテキストデータが記述さ
れた登録ファイルが作成されて、それが登録ディレクト
リ１１３上に出力される。ここで、登録ファイルとは、
ファイル形式の異なる複数種の文書ファイルそれぞれの
文書情報を知識ＤＢ１３１に登録するための共通インタ
ーフェースとして予め決められた記述形式を持つファイ
ルを意味する。この登録ファイルは、複数種の文書ファ
イルそれぞれの文書情報を知識ＤＢ１３１に共通フォー
マットで登録するために利用される。本実施形態では、
登録ファイルとしてＸＭＬ（eXtensible Markup Langua
ge）を使用する。The Web collection module 111 acquires attribute information to be registered in the knowledge DB 131 and extracts the above-mentioned text data for each document file collected from the Internet / Intranet 30. The attribute information acquired from each document file is, for example, the URL of the document file, file creation date / time (update date / time), and the like. Then, the Web collection module 111 creates a registration file in which the attribute information of each of the document files and the text data are described, and outputs it to the registration directory 113. Here, the registration file is
It means a file having a predetermined description format as a common interface for registering document information of each of a plurality of types of document files having different file formats in the knowledge DB 131. This registration file is used to register the document information of each of a plurality of types of document files in the knowledge DB 131 in a common format. In this embodiment,
XML (eXtensible Markup Langua) as a registration file
ge) is used.

【００１６】管理インターフェース１１２は、Ｗｅｂ収
集モジュール１１１に実行させる各Ｗｅｂ収集処理（以
下、Ｗｅｂ収集処理タスクという）の内容を設定するた
めのプログラムであり、各Ｗｅｂ収集処理タスク毎にそ
のＷｅｂ収集条件の設定および管理、Ｗｅｂ収集モジュ
ール１１１の起動・停止の制御、さらには各Ｗｅｂ収集
処理タスクの収集状況の管理・提示等の機能を有してい
る。この管理インターフェース１１２は、管理者ユーザ
がそのユーザ端末のＷｅｂブラウザ２１上から必要な操
作を行えるように、Ｗｅｂサーバ２２がＣＧＩ（Common
Gateway Interface）を通じて起動可能な外部プログラ
ム（ＣＧＩプログラム）として実現されている。The management interface 112 is a program for setting the contents of each web collection process (hereinafter referred to as a web collection process task) to be executed by the web collection module 111, and the web collection conditions for each web collection process task. Of the Web collection module 111, and the functions of managing and presenting the collection status of each Web collection processing task. This management interface 112 allows the web server 22 to execute CGI (Common) so that an administrator user can perform necessary operations on the web browser 21 of the user terminal.
It is realized as an external program (CGI program) that can be started through the Gateway Interface.

【００１７】またＷｅｂ収集システム１１には、Ｗｅｂ
収集モジュール１１１の動作を管理・制御するためのフ
ァイルとして、図示のように、ロックファイル２０１、
制御ファイル２０２、設定ファイル２０３、結果ファイ
ル２０４、およびログファイル２０５が設けられてい
る。Further, the Web collection system 11 includes a Web
As a file for managing and controlling the operation of the collection module 111, a lock file 201,
A control file 202, a setting file 203, a result file 204, and a log file 205 are provided.

【００１８】ロックファイル２０１はＷｅｂ収集モジュ
ール１１１の２重起動を防止するための排他制御用のフ
ァイルである。制御ファイル２０２は管理インターフェ
ース１１２がＷｅｂ収集モジュール１１１を停止させる
ために使用するファイルであり、例えば、管理者ユーザ
からの指示に応答して、実行中のＷｅｂ収集処理タスク
を途中で中断させる場合などに利用される。収集実行中
における中断の指定は、管理者ユーザが管理インターフ
ェース１１２を通じて行うことができる。The lock file 201 is a file for exclusive control for preventing double activation of the Web collection module 111. The control file 202 is a file used by the management interface 112 to stop the Web collection module 111. For example, in the case of interrupting a Web collection processing task that is being executed in response to an instruction from an administrator user. Used for. The administrator user can specify the interruption during the collection execution through the management interface 112.

【００１９】設定ファイル２０３は、各Ｗｅｂ収集処理
タスク毎にそのＷｅｂ収集条件を指定する設定情報を記
述するためのファイルであり、ここには複数のＷｅｂ収
集処理タスクそれぞれに対応する複数の設定情報を保持
することが出来る。各Ｗｅｂ収集処理タスクのＷｅｂ収
集条件は、管理者ユーザによって設定される。Ｗｅｂ収
集条件には、少なくとも、インターネット/イントラネ
ット３０からの情報収集の起点とすべき文書ファイルの
所在を示す起点アドレス情報（起点ＵＲＬ）と、収集対
象とすべき文書ファイル数またはリンク段数の上限値を
示す収集範囲情報、等が含まれている。このＷｅｂ収集
条件に基づいてＷｅｂ収集モジュール１１１のＷｅｂ収
集動作が制御される。すなわち、Ｗｅｂ収集モジュール
１１１は、起点ＵＲＬで指定される文書ファイルを起点
にそれに関連する文書ファイル群を、収集範囲情報で規
定される範囲内で順次収集する。The setting file 203 is a file for describing setting information for designating the Web collection condition for each Web collection processing task, and here, a plurality of setting information corresponding to each of the plurality of Web collection processing tasks is set. Can hold. The web collection condition of each web collection processing task is set by the administrator user. The Web collection conditions include at least the origin address information (origin URL) indicating the location of the document file that should be the origin of information collection from the Internet / Intranet 30, and the upper limit of the number of document files or the number of link stages to be collected. It includes the collection range information, etc. The web collection operation of the web collection module 111 is controlled based on the web collection conditions. That is, the Web collection module 111 sequentially collects the document file group related to the document file specified by the starting point URL within the range defined by the collection range information.

【００２０】また、Ｗｅｂ収集条件には、登録先の知識
データベースを指定する知識データベース名も含まれて
いる。すなわち、知識ＤＢ１３１においては蓄積文書の
種類が互いに異なる複数の知識データベースが定義され
ており、それら複数の知識データベースがそれぞれの知
識データベース名で管理されている。設定ファイル２０
３の設定情報によってＷｅｂ収集タスク毎に登録先の知
識データベース名を指定することにより、収集した文書
情報を知識ＤＢ１３１内の任意の知識データベースに登
録することが出来る。The Web collection condition also includes a knowledge database name that specifies the knowledge database of the registration destination. That is, in the knowledge DB 131, a plurality of knowledge databases having different types of accumulated documents are defined, and the plurality of knowledge databases are managed by respective knowledge database names. Configuration file 20
By designating the knowledge database name of the registration destination for each Web collection task by the setting information of 3, the collected document information can be registered in an arbitrary knowledge database in the knowledge DB 131.

【００２１】さらに、収集範囲情報として、収集対象と
すべきファイル形式の種類、再収集を行う時の収集条
件、等を設定することもできる。ここで、再収集とは、
例えばＷｅｂ収集モジュール１１１に同一のＷｅｂ収集
処理タスクを定期的に実行させる場合における２度目以
降のＷｅｂ収集処理や、一旦中断したＷｅｂ収集処理タ
スクを再開させた場合のＷｅｂ収集処理を言う。Further, as the collection range information, it is possible to set the type of file format to be collected, the collection condition when recollecting, and the like. Here, recollection means
For example, it refers to the second and subsequent web collection processes when the same web collection processing task is periodically executed by the web collection module 111, and the web collection process when the interrupted web collection processing task is restarted.

【００２２】結果ファイル２０４は、以前に収集した文
書ファイルの一覧等をＷｅｂ収集処理の結果として各Ｗ
ｅｂ収集タスク毎に管理するためのファイルであり、こ
こには、Ｗｅｂ収集の結果として取得したＵＲＬのリス
ト、取得日時、取得した文書ファイル件数、などが各Ｗ
ｅｂ収集処理タスク毎に出力される。この結果ファイル
２０４はＷｅｂ収集処理タスクそれぞれの収集状況をユ
ーザに提示する目的と、再収集の効率化の目的に用いら
れる。この結果ファイル２０４を用いることにより、過
去に収集済みの文書ファイルの中でインターネット/イ
ントラネット３０上の情報サイトから削除された文書フ
ァイルを検出したり、Ｗｅｂ収集処理を途中で中断した
場合における再開ポイントの検出、さらには以前の収集
時点からインターネット/イントラネット３０上で更新
されている文書ファイルの検出などを行うことが出来
る。The result file 204 includes a list of previously collected document files and the like as a result of the Web collection process for each W.
This is a file for managing each eb collection task, and here, a list of URLs acquired as a result of Web collection, acquisition date and time, the number of acquired document files, etc.
It is output for each eb collection processing task. The result file 204 is used for the purpose of presenting the collection status of each Web collection processing task to the user and for the purpose of improving the efficiency of recollection. This result file 204 is used to detect a document file deleted from an information site on the Internet / Intranet 30 among the document files that have been collected in the past, or a restart point when the Web collection process is interrupted midway. Can be detected, and further, the document file updated on the Internet / Intranet 30 from the time of the previous collection can be detected.

【００２３】ログフィル２０５は、Ｗｅｂ収集モジュー
ル１１１による文書ファイルの取得の成否やエラーの種
類などを管理するファイルである。The log file 205 is a file for managing the success or failure of acquisition of a document file by the Web collection module 111 and the type of error.

【００２４】登録ディレクトリ１１３は、知識ＤＢ１３
１に登録すべき文書内容を出力するための記憶領域であ
る。ここには、Ｗｅｂ収集モジュール１１１によって各
文書ファイルから抽出された属性情報とテキストデータ
が記述された上述の登録ファイルが出力される。Ｗｅｂ
収集モジュール１１１は、テキストデータの抽出に関し
て第１および第２の２つの処理モードを有する。The registration directory 113 is a knowledge DB 13
This is a storage area for outputting the document content to be registered in 1. The above-mentioned registration file in which the attribute information and the text data extracted from each document file by the Web collection module 111 are described is output here. Web
The collection module 111 has two first and second processing modes for extracting text data.

【００２５】第１処理モードでは、全てのファイル形式
の文書ファイルを対象に解析処理を行ってテキストデー
タの抽出、さらにはリンク情報（ＵＲＬ）の検出が行わ
れる（ＵＲＬの検出はＨＴＭＬファイルのみが対象）。
第２処理モードでは、ＨＴＭＬファイルとテキストファ
イル（PlainText）のみを対象に解析処理を行ってテキ
ストデータの抽出、さらにはリンク情報（ＵＲＬ）の検
出が行われ（ＵＲＬの検出はＨＴＭＬファイルのみが対
象）、例えばPortable Document Format等の他のファイ
ル形式の文書ファイル（以下、コンテンツファイルとい
う）に対しては解析処理は行われない。In the first processing mode, analysis processing is performed on document files of all file formats to extract text data, and link information (URL) is detected (URL is detected only in HTML files. Target).
In the second processing mode, only the HTML file and the text file (PlainText) are analyzed to extract the text data and further the link information (URL) is detected (URL is detected only in the HTML file. ), For example, the analysis process is not performed on a document file of another file format such as Portable Document Format (hereinafter referred to as a content file).

【００２６】第２処理モードを使用した場合、ＨＴＭＬ
とＰｌａｉｎＴｅｘｔについては、そのテキストデータ
と属性情報が登録ファイル上に記述されて登録ディレク
トリ１１３上に出力される。一方、Portable Document
Formatなど他の形式のファイルについてはそのコンテン
ツファイルがそのまま登録ディレクトリ１１３上に出力
され、登録ファイル上には当該ファイルの属性情報とコ
ンテンツファイルのパス名が記述される。なお、第２処
理モードにおいては、リンク情報の検出のための解析処
理が必要なファイルを対象にその解析処理の中でテキス
トデータの抽出も併せて実行するという点が肝要である
ので、ＨＴＭＬファイルのみを対象にテキストデータの
抽出を行うようにしても良い。When the second processing mode is used, HTML
With regard to PlainText, the text data and attribute information are described in the registration file and output to the registration directory 113. On the other hand, Portable Document
For files of other formats such as Format, the content file is output as it is to the registration directory 113, and the attribute information of the file and the path name of the content file are described on the registration file. In addition, in the second processing mode, it is important to perform the extraction of the text data in the analysis process for the file that needs the analysis process for detecting the link information. You may make it extract text data only for a target.

【００２７】第１および第２のどちらの処理モードにお
いても、Ｗｅｂ収集モジュール１１１は実行中のＷｅｂ
収集タスクの処理が完了した時に、登録モジュール１２
に対して登録要求ファイルを発行して文書の登録を登録
モジュール１２に要求する。登録要求ファイルには、登
録ファイルのファイル名、登録先の知識データベース名
などが記述されている。In both the first and second processing modes, the Web collecting module 111 is
When the processing of the collection task is completed, the registration module 12
To request the registration module 12 to register the document. In the registration request file, the file name of the registration file, the knowledge database name of the registration destination, etc. are described.

【００２８】登録モジュール１２は、Ｗｅｂ収集モジュ
ール１１１によって収集された各文書ファイルの属性情
報およびテキストデータを知識ＤＢ１３１に登録する。
すなわち、登録モジュール１２は、Ｗｅｂ収集モジュー
ル１１１からの登録要求に応答して該当する登録ファイ
ルを取得し、その登録ファイル上に文書ファイル毎に記
述されている属性情報およびテキストデータを取り出し
て、知識ＤＢ１３１の登録先知識データベースに登録す
る。また、この登録モジュール１２はテキストデータの
抽出機能も有しており、登録ファイルにコンテンツファ
イルのパス名が記述されている場合は、登録モジュール
１２は、そのパス名で指定されるコンテンツファイルか
らテキストデータを抽出して知識ＤＢ１３１の該当する
登録先知識データベースに登録する。The registration module 12 registers the attribute information and text data of each document file collected by the Web collection module 111 in the knowledge DB 131.
That is, the registration module 12 acquires the corresponding registration file in response to the registration request from the Web collection module 111, extracts the attribute information and text data described for each document file on the registration file, and acquires the knowledge. It is registered in the registration destination knowledge database of the DB 131. The registration module 12 also has a text data extraction function, and if the path name of the content file is described in the registration file, the registration module 12 will use the text from the content file specified by the path name. The data is extracted and registered in the relevant registration destination knowledge database of the knowledge DB 131.

【００２９】さらに、登録モジュール１２は、Ｗｅｂ収
集モジュール１１１以外の他の収集モジュールによって
収集された文書の登録処理も実行する。他の収集モジュ
ールとしては、例えば、電子ファイリングシステムのフ
ァイルサーバから文書情報を収集するファイル収集モジ
ュール、ＲＤＢデータベースサーバからそこに管理され
ている文書情報を収集するデータベース（ＲＤＢ）収集
モジュール、電子掲示板形式のコミュニティからそこに
投稿された文書情報を収集するコミュニティ収集モジュ
ール、ユーザが任意の各種アプリケーションプログラム
の文書ファイルを登録ファイル（ＸＭＬ）の出力形式に
変換するために使用するユーザ収集モジュール等があ
る。登録モジュール１２はこれら各収集モジュールによ
って共通に利用される。どの収集モジュールについて
も、登録モジュール１２に対するインターフェースとし
ては上述のＸＭＬ形式の登録ファイルが用いられる。す
なわち、登録モジュール１２は、各収集モジュールが収
集タスクを完了する度に発行する登録要求を受付け、そ
の登録要求の受付け順に、当該登録要求によって登録要
求された文書収集処理の結果である登録ファイルの内容
を知識ＤＢ１３１の該当する登録先知識データベースに
登録するための登録処理を順次実行する。Further, the registration module 12 also executes the registration process of the documents collected by the collection modules other than the Web collection module 111. Other collection modules include, for example, a file collection module that collects document information from a file server of an electronic filing system, a database (RDB) collection module that collects document information managed by the RDB database server, and an electronic bulletin board format. Community collection module that collects the document information posted there from the community, the user collection module that the user uses to convert the document files of various application programs into the output format of the registration file (XML), and the like. The registration module 12 is commonly used by each of these collection modules. The registration file in the XML format described above is used as an interface to the registration module 12 for any of the collection modules. That is, the registration module 12 receives the registration request issued each time each collection module completes the collection task, and in the order of reception of the registration request, the registration file of the registration file which is the result of the document collection processing requested by the registration request. A registration process for registering the contents in the corresponding registration destination knowledge database of the knowledge DB 131 is sequentially executed.

【００３０】知識エンジン１３は、知識ＤＢ１３１に蓄
積された情報を活用するための知識分析処理を行う。こ
の知識分析処理では、知識ＤＢ１３１に蓄積された大量
の文書情報それぞれの特徴を分析して重要語を抽出する
処理や、それら文書情報を特徴別に複数のカテゴリに分
類および整理するクラスタリング処理などが行われる。
この知識エンジン１３には自然言語検索を行うための知
識エンジン１３２が設けられており、各ユーザはそのユ
ーザ端末のＷｅｂブラウザ４１からＷｅｂサーバ４２を
通じて知識エンジン１３２をアクセスすることにより、
Ｗｅｂブラウザ４１上から知識ＤＢ１３１に蓄積された
知識の検索を行うことが出来る。知識ＤＢ１３１に蓄積
された各文書の属性情報にはその文書のＵＲＬも含まれ
ているので、そのオリジナル文書を該当する情報サイト
から取得してＷｅｂブラウザ４１上に表示することが出
来る。なお、検索処理は、ある知識データベース名を選
択してその選択した知識データベースのみを対象に行っ
たり、あるいは知識ＤＢ１３１内の全ての知識データベ
ースを対象に行うことが出来る。The knowledge engine 13 performs a knowledge analysis process for utilizing the information accumulated in the knowledge DB 131. In this knowledge analysis process, a process of analyzing a feature of each of a large amount of document information accumulated in the knowledge DB 131 and extracting an important word, a clustering process of classifying and organizing the document information into a plurality of categories according to the feature, and the like are performed. Be seen.
The knowledge engine 13 is provided with a knowledge engine 132 for performing a natural language search, and each user accesses the knowledge engine 132 from the web browser 41 of the user terminal through the web server 42.
The knowledge accumulated in the knowledge DB 131 can be searched from the Web browser 41. Since the attribute information of each document stored in the knowledge DB 131 also includes the URL of the document, the original document can be acquired from the corresponding information site and displayed on the Web browser 41. The search process can be performed by selecting a certain knowledge database name and targeting only the selected knowledge database, or by targeting all knowledge databases in the knowledge DB 131.

【００３１】次に、図２を参照して、Ｗｅｂ収集モジュ
ール１１１の機能構成について説明する。Ｗｅｂ収集モ
ジュール１１１は、図示のように、収集制御部３０１、
属性抽出部３０２、テキスト抽出部３０３、およびフォ
ーマット変換部３０４を有している。収集制御部３０１
は、設定ファイル２０３に保持されている全ての設定情
報それぞれについて、その設定情報で指定されたＷｅｂ
収集タスクを順番に一つずつ実行する。各Ｗｅｂ収集タ
スク毎に設定情報で指定された起点ＵＲＬを起点とし
て、インターネット/イントラネット３０から関連する
文書ファイル群を順次収集する。取得した文書ファイル
に含まれるリンク先文書へのＵＲＬはＵＲＬリスト３０
５に追加されて行き、ＵＲＬリスト３０５からＵＲＬを
取得しながらが収集処理が再帰的に実行される。ＵＲＬ
リスト３０５としては、上述の結果ファイル２０４を用
いることが出来る。収集範囲は設定ファイル２０３に保
持されている収集条件によって制限される。Next, the functional configuration of the Web collection module 111 will be described with reference to FIG. The web collection module 111 includes a collection control unit 301,
It has an attribute extraction unit 302, a text extraction unit 303, and a format conversion unit 304. Collection control unit 301
For each of all the setting information held in the setting file 203, the Web specified by the setting information
Run collection tasks one at a time. A group of related document files is sequentially collected from the Internet / Intranet 30 starting from the starting point URL designated by the setting information for each Web collecting task. The URL to the linked document included in the acquired document file is the URL list 30
5, the collection processing is recursively executed while acquiring URLs from the URL list 305. URL
The result file 204 described above can be used as the list 305. The collection range is limited by the collection conditions held in the setting file 203.

【００３２】収集された各文書ファイルはまず属性抽出
部３０２に送られ、そこで各文書ファイルの属性情報が
取得される。次いで、テキスト抽出部３０３にて文書の
解析処理が行われ、知識ＤＢ１３１に登録すべきテキス
トデータと、次に取得すべきリンク先ＵＲＬの抽出が行
われる。例えば、ＨＴＭＬファイルに対しては、テキス
トデータの抽出は、ＨＴＭＬのタグ以外の部分を抜き出
すことによって行う。抽出されたテキストはタイトルと
ボディ(本文)とに分けられる。リンク先ＵＲＬの取得
は、Aタグ、AREFタグのHREFの値、FRAMEタグ、IFRAMEタ
グ、LAYERタグのSRCの値、METAタグのREFRESHの値を取
得することによって行われる。上述の第２の処理モード
で動作する場合には、テキスト抽出部３０３の処理はＨ
ＴＭＬファイルとテキストファイルに対してのみ行わ
れ、Portable Document Format等の他のファイル形式の
文書ファイルに対しては行われない。Each collected document file is first sent to the attribute extraction unit 302, where the attribute information of each document file is acquired. Next, the text extraction unit 303 analyzes the document and extracts the text data to be registered in the knowledge DB 131 and the link destination URL to be acquired next. For example, for an HTML file, extraction of text data is performed by extracting a portion other than the HTML tag. The extracted text is divided into a title and a body (body). The link destination URL is acquired by acquiring the HREF value of the A tag and the AREF tag, the SRC value of the FRAME tag, the IFRAME tag, the LAYER tag, and the REFRESH value of the META tag. When operating in the above-described second processing mode, the processing of the text extracting unit 303 is H
This is done only for TML files and text files, not for document files of other file formats such as Portable Document Format.

【００３３】各文書ファイルから取得された属性情報と
テキストデータはフォーマット変換部３０４に送られ、
そこでＸＭＬの登録ファイルの記述形式に従う出力フォ
ーマットに整形されて登録ディレクトリ１１３上に出力
される。一つの登録ファイルには、例えば１０００個程
度の文書ファイルについての属性情報およびテキストデ
ータが記述される。テキスト抽出が行われなかった文書
ファイルについては属性情報とパス名が登録ファイル上
に記述される。The attribute information and text data acquired from each document file are sent to the format conversion unit 304,
Then, it is formatted into an output format according to the description format of the XML registration file and is output on the registration directory 113. In one registration file, for example, attribute information and text data for about 1000 document files are described. The attribute information and path name of the document file for which the text extraction has not been performed are described in the registration file.

【００３４】以上の処理は、再収集によって収集された
各文書ファイルに対しても同様に行われる。The above processing is similarly performed for each document file collected by recollection.

【００３５】次に、図３および図４を参照して、登録フ
ァイルの出力フォーマットの例を説明する。Next, an example of the output format of the registration file will be described with reference to FIGS. 3 and 4.

【００３６】図３は、テキスト抽出部３０３にてテキス
ト抽出を行う場合の出力フォーマットの例を示してい
る。ファイルの先頭のタグ＜?xml version="1.0" encod
ing="Shift_JIS"?＞、＜KnowledgeSystem＞はファイル
の始まりを示し、また末尾のタグ＜/KnowledgeSystem＞
はファイルの終わりを示している。FIG. 3 shows an example of an output format when the text extraction unit 303 extracts text. Tag at the beginning of the file <? Xml version = "1.0" encod
ing = "Shift_JIS"?>, <KnowledgeSystem> indicates the beginning of the file, and the end tag </ KnowledgeSystem>
Indicates the end of the file.

【００３７】＜RECORD＞と＜/RECORD＞で囲まれた各レ
コード内に、１つの文書ファイルの属性情報とテキスト
データが記述される。各レコード内のタグの意味は次の
通りである。Attribute information and text data of one document file are described in each record enclosed by <RECORD> and </ RECORD>. The meaning of the tag in each record is as follows.

【００３８】MODE ：登録モジュール１２の動作モード
を指定するモード情報である。このモード情報は、各文
書ファイル毎に知識ＤＢ１３１に対する文書情報（テキ
ストデータおよび属性）の登録または削除を指示する。
2＝登録(上書き)か、0＝削除、のいずれかとなる。削除
の場合は、下記のタグのうち、TYPEとUNIQUE以外は出力
されない。MODE: Mode information for designating the operation mode of the registration module 12. This mode information instructs registration or deletion of document information (text data and attributes) in the knowledge DB 131 for each document file.
Either 2 = registration (overwrite) or 0 = delete. In case of deletion, only TYPE and UNIQUE of the following tags are output.

【００３９】TYPE ：収集のタイプを示す。本例では常
に“Ｗｅｂ収集”となる。 UNIQUE ：知識ＤＢ１３１に登録される当該文書ファイ
ルを識別するためのユニークキーである。通常は、当該
文書のＵＲＬがユニークキーとして用いられる。TYPE: Indicates the type of collection. In this example, it is always "Web collection". UNIQUE: A unique key for identifying the document file registered in the knowledge DB 131. Usually, the URL of the document is used as a unique key.

【００４０】ORGDATE ：文書ファイルの作成日時（ま
たは更新日時)を示す。 TITLE ：文書ファイルのタイトルを示す。ＨＴＭＬフ
ァイルのタイトル部から抽出されたテキストデータがタ
イトルとなる。ＨＴＭＬファイル以外のファイルについ
てはタイトルは出力されない。このタイトルは、検索画
面上に表示される各文書ファイルのタイトルとして使用
される。ORGDATE: Indicates the creation date (or update date) of the document file. TITLE: Indicates the title of the document file. The text data extracted from the title part of the HTML file becomes the title. No title is output for files other than HTML files. This title is used as the title of each document file displayed on the search screen.

【００４１】AUTHOR ：文書ファイルを所有する情報サ
イトのホスト名（ＵＲＬのホストアドレス）を記述す
る。 DATE ：上記ORGDATEの日付部分を記述する。 URL ：文書ファイルのＵＲＬ。UNIQUEと同じ値であ
る。 BODY ：文書ファイルから抽出されたテキストデータが
記述される。AUTHOR: Describe the host name (host address of URL) of the information site that owns the document file. DATE: Describe the date part of ORGDATE above. URL: URL of the document file. It has the same value as UNIQUE. BODY: The text data extracted from the document file is described.

【００４２】図４は、テキスト抽出部３０３にてテキス
ト抽出を行わなかった文書ファイルに関する出力フォー
マットの例を示している。FIG. 4 shows an example of an output format relating to a document file for which the text extraction unit 303 has not performed text extraction.

【００４３】BODYにはテキストデータは記述されず、＜
BDYFILE＞＜/BDYFILE＞で囲まれた領域のPATH1に、登録
ディレクトリ１１３上に出力される上記コンテンツファ
イルへのパス名が記述される。DEL＝1は、登録モジュー
ル１２に対してコンテンツファイルからのテキストデー
タの抽出後に登録ディレクトリ１１３上の元ファイルを
削除することを指示するものである。Ｗｅｂ収集モジュ
ール１１１が第２処理モードで動作する場合には、ＨＴ
ＭＬファイルとプレーンテキストについてはBODYにはテ
キストデータが記述され、他のファイル形式の文書ファ
イル（コンテンツファイル）についてはBDYFILEにその
コンテンツファイルのパス名が記述されることになる。No text data is described in BODY.
In PATH1 in the area enclosed by BDYFILE></BDYFILE>, the path name to the content file output on the registration directory 113 is described. DEL = 1 instructs the registration module 12 to delete the original file in the registration directory 113 after extracting the text data from the content file. If the Web collection module 111 operates in the second processing mode, the HT
Text data is described in BODY for the ML file and plain text, and the path name of the content file is described in BDYFILE for document files (content files) of other file formats.

【００４４】次に、図５のフローチャートを参照して、
収集した文書ファイルに対してＷｅｂ収集モジュール１
１１内で実行される一連の処理の手順について説明す
る。Next, referring to the flowchart of FIG.
Web collection module 1 for collected document files
A procedure of a series of processing executed in 11 will be described.

【００４５】まず、収集した文書ファイルから知識ＤＢ
１３１に登録すべき属性情報（URL、AUTHOR、ORGDATE、
DATE）が取得される（ステップＳ１０１）。属性情報の
取得は、ＨＴＴＰによって情報サイトから返される値
や、収集した文書ファイル内に付加されている値などを
用いる事が出来る。この後、第２処理モードにおいて
は、収集した文書ファイルの拡張子などに基づいてその
ファイル種別が判定され、ＨＴＭＬファイルまたはプレ
ーンテキストファイルであるか、あるいはそれ以外の他
のファイル形式のファイルであるかが判別される（ステ
ップＳ１０２，Ｓ１０３）。収集した文書ファイルがＨ
ＴＭＬファイルまたはプレーンテキストファイルである
場合には（ステップＳ１０３のＹＥＳ）、上述のテキス
ト抽出処理（ＨＴＭＬファイルの場合はテキスト抽出と
リンクＵＲＬの検出）が実行され（ステップＳ１０
４）、そして属性情報とテキストデータを上述の形式で
登録ファイル上に記述する処理（テキストデータをBODY
に挿入）が行われる（ステップＳ１０５）。一方、ＨＴ
ＭＬファイルまたはプレーンテキストファイル以外の他
のファイル形式のファイルであれば（ステップＳ１０３
のＮＯ）、当該ファイルが登録ディレクトリ１１３上に
そのまま出力され（ステップＳ１０６）、その後、属性
情報と当該ファイルのパス名を登録ファイル上に記述す
る処理（パス名をBDYFILEに記述）が行われる（ステッ
プＳ１０７）。First, the knowledge DB is acquired from the collected document files.
Attribute information to be registered in 131 (URL, AUTHOR, ORGDATE,
DATE) is acquired (step S101). The attribute information can be acquired by using the value returned from the information site by HTTP, the value added in the collected document file, or the like. After that, in the second processing mode, the file type is determined based on the extension of the collected document file, and the file is an HTML file, a plain text file, or a file of another file format other than that. Is determined (steps S102 and S103). The collected document file is H
If the file is a TML file or a plain text file (YES in step S103), the above-mentioned text extraction processing (text extraction and link URL detection in the case of an HTML file) is executed (step S10).
4), and processing to describe the attribute information and text data in the above-mentioned format on the registration file (text data is BODY
Is inserted) (step S105). On the other hand, HT
If the file has a file format other than the ML file or the plain text file (step S103)
No), the file is directly output to the registration directory 113 (step S106), and then the attribute information and the path name of the file are described in the registration file (the path name is described in BDYFILE) ( Step S107).

【００４６】なお、目的のＵＲＬの文書ファイルがイン
ターネット／イントラネット３０上から取得できなかっ
た場合には、当該文書ファイルの内容が既に知識ＤＢ１
３１に登録されていることを条件に、モード情報（MOD
E）＝0（削除）が登録ファイル上に記述される。If the document file of the target URL cannot be acquired from the Internet / Intranet 30, the content of the document file is already in the knowledge DB1.
Mode information (MOD
E) = 0 (delete) is described in the registration file.

【００４７】また、第１処理モードにおいては、ステッ
プＳ１０２，Ｓ１０３の処理は行われず、全ての取得フ
ァイルを対象にステップＳ１０４，Ｓ１０５の処理が実
行される。Further, in the first processing mode, the processes of steps S102 and S103 are not performed, but the processes of steps S104 and S105 are executed for all the acquired files.

【００４８】次に、図６のフローチャートを参照して、
登録モジュール１２によって実行される登録処理の手順
を説明する。Next, referring to the flowchart of FIG.
The procedure of the registration process executed by the registration module 12 will be described.

【００４９】Ｗｅｂ収集モジュール１１１から登録要求
を受けた登録モジュール１２は、Ｗｅｂ収集モジュール
１１１からの登録要求ファイルで指定された登録ファイ
ル（ＸＭＬファイル）を登録ディレクトリ１１３から取
得し、そしてその登録ファイルから１つずつレコードを
取り出しながら、レコード毎に以下の処理を行う。ま
ず、処理対象レコード内のモード情報がMODE＝０である
かMODE＝２であるかが調べられる（ステップＳ１１１，
Ｓ１１２）。Upon receiving the registration request from the Web collection module 111, the registration module 12 acquires the registration file (XML file) specified by the registration request file from the Web collection module 111 from the registration directory 113, and from the registration file. While fetching records one by one, the following processing is performed for each record. First, it is checked whether the mode information in the record to be processed is MODE = 0 or MODE = 2 (step S111,
S112).

【００５０】MODE＝２の場合は、登録モジュール１２
は、レコード内の各タグに従って、そのタグ内に記述さ
れたデータ項目（TYPE、UNIQUE、ORGDATE、TITLE、AUTH
OR、DATE、URL、BODYそれぞれの内容）を、登録要求フ
ァイルで指定された知識ＤＢ１３１内の登録先知識デー
タベースに登録する（ステップＳ１１３）。BODYのタグ
内にテキストデータが存在しない場合には、そのテキス
トデータの登録は行われない。次いで、BDYFILEのタグ
内にパス名が記述されているかどうかが判定される（ス
テップＳ１１４）。パス名が記述されている場合には
（ステップＳ１１４のＹＥＳ）、そのパス名で指定され
る記憶領域から該当するコンテンツファイルが取得され
（ステップＳ１１５）、そのコンテンツファイルからテ
キストデータを抽出する処理が行われる（ステップＳ１
１６）。そして、その抽出された内容が該当する文書フ
ァイルのテキストデータとして登録先知識データベース
に登録される（ステップＳ１１７）。When MODE = 2, the registration module 12
Is the data item (TYPE, UNIQUE, ORGDATE, TITLE, AUTH) described in each tag according to each tag in the record.
The contents of OR, DATE, URL, and BODY) are registered in the registration destination knowledge database in the knowledge DB 131 specified by the registration request file (step S113). If the text data does not exist in the BODY tag, the text data will not be registered. Then, it is judged whether or not the path name is described in the tag of BDYFILE (step S114). If the path name is described (YES in step S114), the corresponding content file is acquired from the storage area specified by the path name (step S115), and the process of extracting text data from the content file is executed. Performed (step S1
16). Then, the extracted contents are registered in the registration destination knowledge database as text data of the corresponding document file (step S117).

【００５１】MODE＝０の場合は、レコード内のUNIQUEで
指定される登録済みの文書ファイルの属性情報とテキス
トデータが知識ＤＢ１３１からサーチされ、その登録内
容が登録先知識データベースから削除される（ステップ
Ｓ１１８）。When MODE = 0, the attribute information and text data of the registered document file designated by UNIQUE in the record are searched from the knowledge DB 131, and the registered content is deleted from the registration destination knowledge database (step S118).

【００５２】Ｗｅｂ収集モジュール１１１以外の他の各
収集モジュールからの登録要求に対しても、同様の処理
が実行される。本システムでは登録モジュール１２は一
つであるので、登録処理は順番に一つずつ実行されるこ
とになる。Similar processing is executed for registration requests from each collection module other than the Web collection module 111. Since there is one registration module 12 in this system, the registration processing is executed one by one in order.

【００５３】図７には、インターネット／イントラネッ
ト３０上における文書ファイル（Ｗｅｂコンテンツ）の
状態と知識ＤＢ１３１に対して行うべき登録／削除処理
との関係が示されている。本実施形態においては、出来
る限りインターネット／イントラネット３０上における
最新のＷｅｂコンテンツの状態を知識ＤＢ１３１に反映
させるというポリシーに基づき、以下の処理が行われ
る。FIG. 7 shows the relationship between the state of the document file (Web content) on the Internet / Intranet 30 and the registration / deletion process to be performed on the knowledge DB 131. In the present embodiment, the following processing is performed based on the policy of reflecting the state of the latest Web content on the Internet / Intranet 30 in the knowledge DB 131 as much as possible.

【００５４】（１）更新されたＷｅｂコンテンツは、知
識ＤＢ１３１に上書きする（MODE＝２）。更新されたＷ
ｅｂコンテンツの属性情報およびテキストデータを登録
する場合、そのUNIQUEは、知識ＤＢ１３１に既に登録さ
れている更新前のＷｅｂコンテンツのUNIQUEと同一であ
る。よって、MODE＝２により、知識ＤＢ１３１に既に登
録されている更新前のＷｅｂコンテンツの属性情報およ
びテキストデータが、更新されたＷｅｂコンテンツの属
性情報およびテキストデータに更新（上書き）される。(1) The updated Web content is overwritten on the knowledge DB 131 (MODE = 2). Updated W
When registering the attribute information and text data of the eb content, the UNIQUE is the same as the UNIQUE of the pre-update Web content already registered in the knowledge DB 131. Therefore, MODE = 2 updates (overwrites) the attribute information and text data of the pre-update Web content already registered in the knowledge DB 131 with the updated Web content attribute information and text data.

【００５５】（２）追加されたＷｅｂコンテンツは、知
識ＤＢ１３１に追加登録する（MODE＝２）。追加された
Ｗｅｂコンテンツの属性情報およびテキストデータを登
録する場合、そのUNIQUEは、知識ＤＢ１３１に未登録で
ある。よって、MODE＝２により、追加されたＷｅｂコン
テンツの属性情報およびテキストデータを知識ＤＢ１３
１に追加登録することが出来る。(2) The added Web contents are additionally registered in the knowledge DB 131 (MODE = 2). When registering the attribute information and text data of the added Web content, the UNIQUE is not registered in the knowledge DB 131. Therefore, by setting MODE = 2, the attribute information and text data of the added Web contents are stored in the knowledge DB 13.
You can additionally register to 1.

【００５６】（３）削除されたＷｅｂコンテンツは、知
識ＤＢ１３１からも削除する（MODE＝０）。(3) The deleted Web contents are also deleted from the knowledge DB 131 (MODE = 0).

【００５７】次に、管理インターフェース１１２によっ
て提供される機能を具体的に説明する。管理インターフ
ェース１１２は上述のようにＷｅｂ収集タスクの設定・
管理・実行を行うためのユーザインタフェースであり、
１）Ｗｅｂ収集タスクの設定を複数個作成して設定ファ
イル２０４に保存する機能、２）収集の設定の一覧表
示、削除、などの管理作業を行う機能、３）収集の起動
と終了（中断と再開）を行う機能、４）収集の状況や実
績をリアルタイムに呈示するステータス一覧表示機能、
を管理者ユーザに提供する。ステータス一覧表示機能
は、全ての収集タスクそれぞれの収集／登録の状況の一
覧を管理者ユーザのＷｅｂブラウザ２１の画面上に表示
する機能であり、各収集タスク毎にその状況を管理者ユ
ーザが容易に把握できるように「収集中」、「登録待
ち」、「登録中」などのステータスを表示する。Next, the functions provided by the management interface 112 will be specifically described. As described above, the management interface 112 sets the Web collection task settings /
It is a user interface for managing and executing,
1) A function of creating a plurality of settings of a Web collection task and saving them in the setting file 204. 2) A function of performing management work such as displaying a list of collection settings and deleting. 3) Starting and ending collection (interruption and interruption). Resuming) function, 4) Status list display function that presents the collection status and results in real time,
To the admin user. The status list display function is a function of displaying a list of collection / registration statuses of all collection tasks on the screen of the administrator user's Web browser 21, and the administrator user can easily display the status for each collection task. The status such as “collecting”, “waiting for registration”, and “registering” is displayed so that the user can understand.

【００５８】ここで、「収集中」は、該当する収集タス
クが開始され収集処理中であることを示すステータスで
ある。「収集中」の場合には現在の収集済み件数も表示
される。「登録待ち」は、該当する収集タスクの収集処
理が完了し、登録モジュール１２による登録処理待ちで
あることを示すステータスである。「登録中」は、登録
モジュール１２による登録処理が開始され登録処理中で
あることを示すステータスである。「登録中」の場合に
は現在の登録済み件数も表示される。収集／登録の状況
を示すステータスは、「収集中」→「登録待ち」→「登
録中」の順に更新される。Here, "collecting" is a status indicating that the corresponding collecting task has been started and the collecting process is being executed. In the case of "collecting", the current number of collected items is also displayed. “Waiting for registration” is a status indicating that the collecting process of the corresponding collecting task is completed and is waiting for the registering process by the registration module 12. “Registering” is a status indicating that the registration process by the registration module 12 has started and is in the process of being registered. In the case of "registering", the number of registered items is displayed. The status indicating the status of collection / registration is updated in the order of “collecting” → “waiting for registration” → “registering”.

【００５９】次に、図８を参照して、収集タスクそれぞ
れのステータス一覧表示を実現するための仕組みについ
て説明する。Next, with reference to FIG. 8, a mechanism for realizing the status list display of each collecting task will be described.

【００６０】上述したように、本システムにおいては、
Ｗｅｂ収集モジュール１１１のみならず、ファイル収集
モジュール、ＲＤＢ収集モジュール、コミュニティ収集
モジュール、およびユーザ収集モジュールも動作してお
り、登録モジュール１２は、これら複数の収集モジュー
ルによって共通に利用される。ステータス管理のために
各収集モジュールが実行する処理はどれも同じであるの
で、以下では、Ｗｅｂ収集モジュール１１１に着目し
て、そのステータス管理のための機能を説明する。As described above, in this system,
Not only the Web collection module 111, but also the file collection module, the RDB collection module, the community collection module, and the user collection module are operating, and the registration module 12 is commonly used by these plurality of collection modules. Since the processing executed by each collection module for status management is the same, the function for status management will be described below, focusing on the Web collection module 111.

【００６１】Ｗｅｂ収集モジュール１１１は、設定ファ
イル２０３に保持されている複数の設定情報それぞれで
指定される複数のＷｅｂ収集タスクを順次実行する。こ
の場合、Ｗｅｂ収集モジュール１１１は、処理対象のＷ
ｅｂ収集タスク毎に、その開始時に当該Ｗｅｂ収集タス
クに固有のステータス情報ファイル３１１を作成する。
このステータス情報ファイル３１１は、当該Ｗｅｂ収集
タスクの収集開始から登録までの間における進捗状況を
管理するためのファイルであり、例えば、当該Ｗｅｂ収
集タスクの設定名とその収集開始の年月日時分秒とを含
むファイル名等で管理される。ファイル作成時には、ス
テータス情報ファイル３１１に「収集中」を示すステー
タス情報が書き込まれる。そして、ステータス情報ファ
イル３１１内の収集件数の値を更新しながら、Ｗｅｂ収
集モジュール１１１による収集処理が行われる。Ｗｅｂ
収集モジュール１１１による収集処理が完了した場合、
Ｗｅｂ収集モジュール１１１は、ステータス情報ファイ
ル３１１に「登録待ち」を示すステータス情報を書き込
んでステータスを「収集中」から「登録待ち」に更新し
た後、当該Ｗｅｂ収集タスクで収集した文書情報を記述
した登録ファイル（ＸＭＬファイル）と、登録要求を含
む登録要求ファイルを登録モジュール１２に出力する。
登録要求ファイルには、該当するＷｅｂ収集タスクに対
応したステータス情報ファイル３１１のファイル名等が
含まれている。The web collection module 111 sequentially executes a plurality of web collection tasks designated by a plurality of setting information held in the setting file 203. In this case, the Web collection module 111 uses the W
For each eb collection task, a status information file 311 specific to the Web collection task is created at the start of the eb collection task.
The status information file 311 is a file for managing the progress status from the collection start of the Web collection task to the registration. For example, the setting name of the Web collection task and the year / month / day / hour / minute / second of the collection start. It is managed by the file name including and. At the time of creating the file, status information indicating “collecting” is written in the status information file 311. Then, the collection processing by the Web collection module 111 is performed while updating the value of the number of collections in the status information file 311. Web
When the collection processing by the collection module 111 is completed,
The Web collection module 111 writes status information indicating “waiting for registration” to the status information file 311, updates the status from “collecting” to “waiting for registration”, and then describes the document information collected by the Web collecting task. The registration file (XML file) and the registration request file including the registration request are output to the registration module 12.
The registration request file includes the file name of the status information file 311 corresponding to the corresponding Web collection task.

【００６２】登録モジュール１２は、実行を開始した登
録処理の進捗に合わせて、当該登録処理の対象となって
いるＷｅｂ収集タスクに対応したステータス情報ファイ
ル３１１に当該登録処理の進捗状況を示すステータス情
報を書き込む。この場合、登録処理の開始時にはステー
タスを「登録待ち」から「登録中」に更新する。そし
て、ステータス情報ファイル３１１内の登録件数の値を
更新しながら、登録モジュール１２による登録処理が行
われる。登録処理が完了すると、該当するステータス情
報ファイル３１１は登録モジュール１２によって削除さ
れる。In accordance with the progress of the registration process that has started to be executed, the registration module 12 writes status information indicating the progress status of the registration process in the status information file 311 corresponding to the Web collection task that is the target of the registration process. Write. In this case, the status is updated from "waiting for registration" to "registering" at the start of the registration process. Then, the registration processing by the registration module 12 is performed while updating the value of the number of registrations in the status information file 311. When the registration process is completed, the corresponding status information file 311 is deleted by the registration module 12.

【００６３】このように、Ｗｅｂ収集タスク毎にその収
集処理の進捗状況と登録処理の進捗状況とを同一のステ
ータス情報ファイル３１１に書き込むことにより、収集
処理が開始されたＷｅｂ収集タスクそれぞれに関する収
集／登録の状況を個別に管理することが出来る。As described above, by writing the progress status of the collection processing and the progress status of the registration processing for each Web collection task in the same status information file 311, the collection / collection for each Web collection task for which the collection processing is started is started. The registration status can be managed individually.

【００６４】ステータス一覧表示プログラム３１２はス
テータス一覧表示画面３１３を管理者ユーザの端末に表
示するために設けられたプログラムであり、上述の管理
インターフェース１１２の一部として実現されている。
このステータス一覧表示プログラム３１２は、複数の収
集モジュールがそれぞれ収集を開始する度に作成するス
テータス情報ファイル３１１の内容に基づいて、全ての
収集処理に関する収集開始から登録までの間における現
在の進捗状況を示すステータス一覧表示画面３１３を表
示する。The status list display program 312 is a program provided to display the status list display screen 313 on the terminal of the administrator user, and is realized as a part of the management interface 112 described above.
This status list display program 312 shows the current progress status from the start of collection to the registration of all collection processing based on the contents of the status information file 311 created each time a plurality of collection modules starts collection. The status list display screen 313 is displayed.

【００６５】図９には、ステータス一覧表示画面３１３
の例が示されている。図示のように、ステータス一覧表
示画面には、複数の収集モジュールによって開始された
収集タスクそれぞれについて、「知識データベース
名」、「収集タイプ」、「設定名（収集タスク名）」、
「収集対象」、「ステータス」、「対象件数（登録件数
／収集件数）」が表示される。図９においては、Ｗｅｂ
収集については、設定名（収集タスク名）がｉｎｆｏ
１，ｉｎｆｏ２，ｉｎｆｏ３の３つの収集タスクが実行
されており、ｉｎｆｏ１のタスクは「登録中」、ｉｎｆ
ｏ２のタスクは「登録待ち」、ｉｎｆｏ３のタスクは
「収集中」である場合が示されている。さらに、ファイ
ル収集モジュールによるファイル収集タスク、ユーザ収
集モジュールによるユーザ収集タスク、コミュニティ収
集モジュールによるコミュニティ収集タスク、ＲＤＢ収
集モジュールによる２つのＲＤＢ収集タスクも開始され
ており、ファイル収集タスクおよびユーザ収集タスクは
それぞれ「登録待ち」であり、コミュニティ収集タスク
は「収集中」、ＲＤＢ収集モジュールによる２つのＲＤ
Ｂ収集タスクの内の一方は「登録待ち」、他方は「収集
中」となっている。登録モジュール１２は一つであるの
で、「登録中」のタスクは常に一つである。また、各収
集モジュールについても、「収集中」のタスクは基本的
には常に一つである。FIG. 9 shows the status list display screen 313.
An example of is shown. As shown in the figure, on the status list display screen, for each collection task started by a plurality of collection modules, "knowledge database name", "collection type", "setting name (collection task name)",
"Collection target", "Status", "Target number (Registered number / Collected number)" are displayed. In FIG. 9, the Web
For collection, the setting name (collection task name) is info
Three collection tasks of 1, info2 and info3 are executed, and the task of info1 is "registering", inf
It is shown that the task of o2 is “waiting for registration” and the task of info3 is “collecting”. Furthermore, the file collection task by the file collection module, the user collection task by the user collection module, the community collection task by the community collection module, and the two RDB collection tasks by the RDB collection module are also started, and the file collection task and the user collection task respectively "Waiting for registration", community collection task is "collecting", 2 RDs by RDB collection module
One of the B collection tasks is “waiting for registration” and the other is “collecting”. Since there is only one registration module 12, there is always one "registering" task. Also, for each collection module, there is always one task that is “collecting”.

【００６６】さらに、ステータス一覧表示画面３１３に
は、「状況更新」ボタン４０１および「ステータス削
除」ボタン４０２が設けられている。「状況更新」ボタ
ン４０１が押されると、各タスクのステータスが最新の
ステータスに更新される。「ステータス削除」ボタン４
０２は、ステータス一覧表示画面３１３上で選択したタ
スクのステータス表示をステータス一覧表示画面３１３
から削除する場合に使用される。Further, the status list display screen 313 is provided with a "status update" button 401 and a "status delete" button 402. When the “update status” button 401 is pressed, the status of each task is updated to the latest status. "Delete status" button 4
02 indicates the status display of the task selected on the status list display screen 313.
Used when removing from.

【００６７】図１０には、Ｗｅｂ収集モジュール１１１
および登録モジュール１２によるステータス情報の更新
処理の様子が示されている。FIG. 10 shows the Web collection module 111.
The status information update processing by the registration module 12 is also shown.

【００６８】（１）Ｗｅｂ収集タスクの開始時には、Ｗ
ｅｂ収集モジュール１１１によりステータス情報ファイ
ル３１１が作成され、そこに「収集中」のステータス情
報が書き込まれる。（２）収集処理中は、Ｗｅｂ収集モジュール１１１は、
新たな文書ファイルを収集する度にステータス情報ファ
イル３１１の収集件数情報を更新する。（３）収集処理の完了時には、Ｗｅｂ収集モジュール１
１１は、ステータス情報ファイル３１１に「登録待ち」
のステータス情報を書き込み、現在のステータスを「収
集中」から「登録待ち」に更新する。（４）そして、Ｗｅｂ収集モジュール１１１から登録モ
ジュール１２に登録要求ファイルが発行され、登録モジ
ュール１２に対して登録処理の実行が要求される。この
後、Ｗｅｂ収集モジュール１１１は、次のＷｅｂ収集タ
スクを開始する事が出来る。(1) At the start of the Web collection task, W
A status information file 311 is created by the eb collection module 111, and the status information “collecting” is written therein. (2) During the collection process, the Web collection module 111
Every time a new document file is collected, the collection number information of the status information file 311 is updated. (3) When the collection process is completed, the Web collection module 1
11 is “waiting for registration” in the status information file 311
The status information of is written and the current status is updated from “collecting” to “waiting for registration”. (4) Then, the Web collection module 111 issues a registration request file to the registration module 12, and requests the registration module 12 to execute the registration process. After this, the web collection module 111 can start the next web collection task.

【００６９】（５）登録処理の開始時には、登録モジュ
ール１２は、ステータス情報ファイル３１１に「登録
中」のステータス情報を書き込み、現在のステータスを
「登録待ち」から「登録中」に更新する。（６）登録処理中は、登録モジュール１２は、新たな文
書情報を登録する度にステータス情報ファイル３１１の
登録件数情報を更新する。（７）登録処理の完了時には、登録モジュール１２は、
ステータス情報ファイル３１１に「登録待ち」のステー
タス情報を書き込み、現在のステータスを「収集中」か
ら「登録待ち」に更新する。この後、登録モジュール１
２は、登録待ちの次の収集タスクを開始する。ステータス情報ファイル３１１の内容はステータス一覧
プログラム３０２によって定期的（例えば３０秒毎）に
チェックされ、各タスクそれぞれの現在のステータスが
ステータス一覧表示画面３１３に表示される。上述の
「状況更新」ボタン４０１が押された場合には、その時
点でステータス情報ファイル３１１の内容がステータス
一覧プログラム３０２によって調べられ、ステータス一
覧表示画面３１３が最新の状態に更新される。(5) At the start of the registration process, the registration module 12 writes the status information of "registering" in the status information file 311, and updates the current status from "waiting for registration" to "registering". (6) During the registration process, the registration module 12 updates the registration count information of the status information file 311 every time new document information is registered. (7) When the registration process is completed, the registration module 12
The status information “waiting for registration” is written in the status information file 311, and the current status is updated from “collecting” to “waiting for registration”. After this, registration module 1
2 starts the next collection task waiting for registration. The contents of the status information file 311 are periodically (for example, every 30 seconds) checked by the status list program 302, and the current status of each task is displayed on the status list display screen 313. When the "state update" button 401 is pressed, the contents of the status information file 311 are checked by the status list program 302 at that time, and the status list display screen 313 is updated to the latest state.

【００７０】次に、図１１のフローチャートを参照し
て、Ｗｅｂ収集モジュール１１１によって各Ｗｅｂ収集
タスク毎に実行される一連の処理手順を説明する。Next, with reference to the flowchart of FIG. 11, a series of processing procedures executed by the Web collection module 111 for each Web collection task will be described.

【００７１】Ｗｅｂ収集タスクを実行する場合、まず、
当該処理対象のＷｅｂ収集タスクに対応したステータス
情報ファイル３１１が作成され、そこに「収集中」を示
すステータス情報が書き込まれる（ステップＳ２０
１）。この後、設定ファイル２０３に保持されている当
該Ｗｅｂ収集タスクに対応する設定情報に基づいて、イ
ンターネット／イントラネット３０上から文書ファイル
群を順次取得するＷｅｂ収集処理が開始される（ステッ
プＳ２０２）。Ｗｅｂ収集処理においては、最初に起点
ＵＲＬで指定される文書ファイルの取得が行われ、リン
ク先ＵＲＬが含まれている場合にはそれが結果ファイル
２０４に追加されていく。文書ファイルの取得が行われ
る度、ステータス情報ファイル３１１の収集件数情報が
＋１ずつ更新される（ステップＳ２０３）。そして、結
果ファイル２０４に未収集のＵＲＬが登録されているか
どうかによってＷｅｂ収集処理が完了したかどうかが判
断される（ステップＳ２０４）。結果ファイル２０４上
から未収集のＵＲＬがなくなるまで、当該ＵＲＬで指定
される文書ファイルを取得する処理（ステップＳ２０
２）と収集件数情報の更新処理（ステップＳ２０３）が
繰り返し実行される。When executing the Web collection task, first,
The status information file 311 corresponding to the Web collection task to be processed is created, and the status information indicating "collecting" is written therein (step S20).
1). After that, the Web collection process of sequentially acquiring the document file group from the Internet / Intranet 30 is started based on the setting information corresponding to the Web collection task held in the setting file 203 (step S202). In the Web collection process, the document file specified by the origin URL is first acquired, and if the link destination URL is included, it is added to the result file 204. Every time the document file is acquired, the collection number information of the status information file 311 is updated by +1 (step S203). Then, it is determined whether or not the Web collection processing is completed depending on whether or not an uncollected URL is registered in the result file 204 (step S204). Until there are no uncollected URLs in the result file 204, the process of acquiring the document file specified by the URL (step S20).
2) and the collection number information update process (step S203) are repeatedly executed.

【００７２】Ｗｅｂ収集処理が完了すると（ステップＳ
２０４のＹＥＳ）、ステータス情報ファイル３１１に
「登録待ち」を示すステータス情報が書き込まれ、これ
によって現在のステータスが「収集中」から「登録待
ち」に更新された後（ステップＳ２０５）、登録要求フ
ァイルが発行される（ステップＳ２０６）。When the Web collection process is completed (step S
(YES at 204), after the status information indicating “waiting for registration” is written in the status information file 311, the current status is updated from “collecting” to “waiting for registration” (step S205), and then the registration request file. Is issued (step S206).

【００７３】次に、図１２のフローチャートを参照し
て、登録モジュール１２によって各Ｗｅｂ収集タスク毎
に実行される一連の登録処理の手順を説明する。Next, with reference to the flowchart of FIG. 12, a procedure of a series of registration processes executed by the registration module 12 for each Web collection task will be described.

【００７４】登録処理の開始時には、まず、登録ファイ
ルで指定される当該登録対象のＷｅｂ収集タスクに対応
したステータス情報ファイル３１１に対して、「登録
中」を示すステータス情報が書き込まれ、現在のステー
タスが「登録待ち」から「登録中」に更新される（ステ
ップＳ２１１）。次いで、登録ファイルから１件ずつレ
コードを取り出しながら、文書情報（属性情報およびテ
キスト）を登録先の知識データベースに登録する登録処
理が行われる（ステップＳ２１２）。文書情報の登録の
度に、ステータス情報ファイル３１１の登録件数情報が
＋１ずつ更新される（ステップＳ２１３）。全てのレコ
ードの登録が完了するまで、ステップＳ２１２およびス
テップＳ２１３の処理が繰り返し実行される。全てのレ
コードの登録が完了すると（ステップＳ２１４のＮ
Ｏ）、ステータス情報ファイル３１１が削除される（ス
テップＳ２１５）。このファイル削除により、登録処理
が完了したＷｅｂ収集タスクについてはステータス一覧
表示対象から自動的に除外される。ステータス一覧表示
プログラム３１２は全てのステータス情報ファイル３１
１を定期的にチェックするが、登録完了時に該当するス
テータス情報ファイル３１１を自動削除することによ
り、稼働中のタスクそれぞれに対応したステータス情報
ファイル３１１だけをチェック対象とすることが出来
る。At the start of the registration process, first, the status information indicating "registering" is written in the status information file 311 corresponding to the Web collection task to be registered, which is designated by the registration file, and the current status is displayed. Is updated from "waiting for registration" to "registering" (step S211). Next, registration processing is performed in which the document information (attribute information and text) is registered in the knowledge database of the registration destination while fetching records one by one from the registration file (step S212). Every time the document information is registered, the registration number information of the status information file 311 is updated by +1 (step S213). The processes of steps S212 and S213 are repeatedly executed until registration of all records is completed. When registration of all records is completed (N in step S214)
O), the status information file 311 is deleted (step S215). By deleting this file, the Web collection task for which the registration process has been completed is automatically excluded from the status list display targets. The status list display program 312 includes all status information files 31.
1 is checked periodically, but by automatically deleting the corresponding status information file 311 when registration is completed, only the status information file 311 corresponding to each running task can be checked.

【００７５】以上のように、本実施形態によれば、稼働
中の全ての収集処理それぞれの収集／登録の状況を一覧
表示することができるので、管理者ユーザは、ログ解析
などを行うことなく、設定した収集処理の中でどの収集
処理がどのような状況にあるかを容易に把握することが
可能となる。As described above, according to this embodiment, the collection / registration statuses of all the collection processes in operation can be displayed in a list, so that the administrator user does not perform log analysis or the like. It is possible to easily understand which collection process is in what kind of situation among the set collection processes.

【００７６】なお、本実施形態の知識情報収集システム
の機能は全てコンピュータプログラムにより実現されて
いるので、そのコンピュータプログラムをコンピュータ
読み取り可能な記憶媒体に記憶しておき、その記憶媒体
を通じて本コンピュータプログラムを、コンピュータネ
ットワーク接続可能な通常のコンピュータに導入して実
行させるだけで、本実施形態と同様の効果を容易に得る
ことができる。Since all the functions of the knowledge information collecting system of this embodiment are realized by a computer program, the computer program is stored in a computer-readable storage medium, and the computer program is stored through the storage medium. It is possible to easily obtain the same effect as that of the present embodiment only by installing the program in a normal computer that can be connected to a computer network and executing the program.

【００７７】また本発明は、上記実施形態に限定される
ものではなく、実施段階ではその要旨を逸脱しない範囲
で種々に変形することが可能である。更に、上記実施形
態には種々の段階の発明が含まれており、開示される複
数の構成要件における適宜な組み合わせにより種々の発
明が抽出され得る。例えば、実施形態に示される全構成
要件から幾つかの構成要件が削除されても、発明が解決
しようとする課題の欄で述べた課題が解決でき、発明の
効果の欄で述べられている効果が得られる場合には、こ
の構成要件が削除された構成が発明として抽出され得
る。Further, the present invention is not limited to the above-mentioned embodiment, but can be variously modified at the stage of implementation without departing from the spirit of the invention. Furthermore, the embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, the problem described in the section of the problem to be solved by the invention can be solved, and the effect described in the section of the effect of the invention can be solved. When the above is obtained, the configuration in which this constituent element is deleted can be extracted as the invention.

【００７８】[0078]

【発明の効果】以上説明したように、本発明によれば、
複数の文書収集処理それぞれの収集／登録の処理状況を
容易に把握することが可能となる。As described above, according to the present invention,
It is possible to easily grasp the collection / registration processing status of each of a plurality of document collection processing.

[Brief description of drawings]

【図１】本発明の一実施形態に係る知識情報収集システ
ムの構成を示すブロック図。FIG. 1 is a block diagram showing the configuration of a knowledge information collecting system according to an embodiment of the present invention.

【図２】同実施形態の知識情報収集システムに設けられ
たＷｅｂ収集モジュールの機能構成を示すブロック図。FIG. 2 is an exemplary block diagram showing the functional configuration of a Web collection module provided in the knowledge information collection system of the same embodiment.

【図３】同実施形態の知識情報収集システムで用いられ
る登録ファイルの出力形式の例を示す図。FIG. 3 is a diagram showing an example of an output format of a registration file used in the knowledge information collecting system of the embodiment.

【図４】同実施形態の知識情報収集システムで用いられ
る登録ファイルの出力形式の他の例を示す図。FIG. 4 is an exemplary view showing another example of an output format of a registration file used in the knowledge information collecting system of the embodiment.

【図５】同実施形態の知識情報収集システムに設けられ
たＷｅｂ収集モジュールの処理手順を示すフローチャー
ト。FIG. 5 is an exemplary flowchart showing a processing procedure of a Web collection module provided in the knowledge information collection system of the embodiment.

【図６】同実施形態の知識情報収集システムに設けられ
た登録モジュールの処理手順を示すフローチャート。FIG. 6 is an exemplary flowchart showing a processing procedure of a registration module provided in the knowledge information collecting system of the embodiment.

【図７】同実施形態の知識情報収集システムおけるＷｅ
ｂコンテンツとその登録処理との関係を示す図。FIG. 7: We in the knowledge information collecting system of the same embodiment
The figure which shows the relationship between b content and its registration processing.

【図８】同実施形態の知識情報収集システムによるステ
ータス一覧表示の原理を説明するための図。FIG. 8 is a view for explaining the principle of status list display by the knowledge information collection system of the same embodiment.

【図９】同実施形態の知識情報収集システムで用いられ
るステータス一覧表示画面の例を示す図。FIG. 9 is an exemplary view showing an example of a status list display screen used in the knowledge information collecting system of the embodiment.

【図１０】同実施形態の知識情報収集システムに設けら
れたＷｅｂ収集モジュールおよび登録モジュールそれぞ
れによるステータス情報更新処理を説明するための図。FIG. 10 is an exemplary view for explaining status information update processing by each of the Web collection module and the registration module provided in the knowledge information collection system of the embodiment.

【図１１】同実施形態の知識情報収集システムに設けら
れたＷｅｂ収集モジュールがＷｅｂ収集タスク毎に実行
する一連の処理手順を説明するためのフローチャート。FIG. 11 is an exemplary flowchart for explaining a series of processing procedures executed by each Web collection task by the Web collection module provided in the knowledge information collection system of the embodiment.

【図１２】同実施形態の知識情報収集システムに設けら
れた登録モジュールが実行する登録処理の手順を説明す
るフローチャート。FIG. 12 is an exemplary flowchart illustrating a procedure of registration processing executed by a registration module provided in the knowledge information collecting system of the embodiment.

[Explanation of symbols]

１１…Ｗｅｂ収集システム１２…登録モジュール１３…知識エンジン３０…インターネット／イントラネット１１１…Ｗｅｂ収集モジュール１１２…管理インターフェース１１３…登録ディレクトリ１３１…知識データベース１３２…検索エンジン２０１…ロックファイル２０２…制御ファイル２０３…設定ファイル２０４…結果ファイル２０５…ログファイル３０１…収集制御部３０２…属性抽出部３０３…テキスト抽出部３０４…フォーマット変換部３１１…ステータス情報ファイル３１２…ステータス一覧表示プログラム３１３…ステータス一覧表示画面 11 ... Web collection system 12 ... Registration module 13 ... Knowledge engine 30 ... Internet / Intranet 111 ... Web collection module 112 ... Management interface 113 ... Registration directory 131 ... Knowledge database 132 ... Search engine 201 ... Lock file 202 ... control file 203 ... Setting file 204 ... Result file 205 ... log file 301 ... Collection control unit 302 ... Attribute extraction unit 303 ... Text extraction unit 304 ... Format conversion unit 311 ... Status information file 312 ... Status list display program 313 ... Status list display screen

───────────────────────────────────────────────────── フロントページの続き (72)発明者塩田弘二東京都青梅市末広町２丁目９番地株式会社東芝青梅工場内Ｆターム(参考） 5B075 ND03 NK44 QP01 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Koji Shioda 2-9 Suehiro-cho, Ome City, Tokyo Stock Market Company Toshiba Ome Factory F term (reference) 5B075 ND03 NK44 QP01

Claims

[Claims]

1. A knowledge information collecting system for collecting document information to be registered in a knowledge database from a network, wherein a plurality of the plurality of setting information indicating a collection condition of a document file group from the network are set. A document collection unit that executes a plurality of document collection processes corresponding to respective setting information, and manages the progress status from the collection start to the registration regarding the document collection process at the start of each document collection process to be processed. And a document collection unit that writes status information indicating the progress status of the document collection process to a status information file corresponding to the document collection process in accordance with the progress of the document collection process to be processed. , A registration request issued from the document collecting means each time each document collecting process is completed It is a registration unit that sequentially executes a registration process for registering the result of the document collection process requested by the registration request in the knowledge database in the order of acceptance and reception of the registration request. A registration unit that writes status information indicating the progress status of the registration process to a status information file corresponding to the document collection process that is the target of the registration process according to the progress, and a document collection process in which the document collection process is started. Knowledge comprising a status display means for displaying a status list screen showing the current progress status of each document collection process based on the status information held in the status information file corresponding to each process Information collection system.

2. The registration means includes means for deleting a status information file corresponding to the document collection process that was the target of the registration process when the registration process started was completed. Knowledge information collection system.

3. The document collection means writes status information indicating that the document collection process is being collected into a status information file corresponding to the document collection process at the start of the document collection process, and the started document When the collection process is completed, the status information of the status information file corresponding to the completed document collection process is
The registration means includes means for updating the status information indicating that the registration processing is being waited for by the registration means, and the registration means stores the status information file corresponding to the document collection processing targeted for the registration processing at the start of the registration processing. A means for updating the status information to the status information indicating that the registration process is being executed, and deleting the status information file corresponding to the document collection process which was the target of the registration process when the registration process started is completed. The knowledge information collecting system according to claim 1, further comprising:

4. The status information file includes document collection number information indicating the number of document files collected in the document collection process, and document information corresponding to each document file registered in the registration process. The document collection number information indicating the number of cases is included, and the document collection unit updates the document collection number information of the status information file every time a document file is collected from the network by the started document collection process. 4. The knowledge according to claim 3, wherein the registration means includes means for updating the document registration number information of the status information file every time the document information is registered in the knowledge database by the started registration process. Information collection system.

5. A knowledge information collecting system for collecting document information to be registered in a knowledge database, wherein the knowledge information collecting system is provided for each of a plurality of different information sources, and each collects a document file group from the corresponding information source. A plurality of document collecting means for executing the document collecting process for managing the progress of the document collecting process between the start of collection and the registration of the document collecting process at the start of the document collecting process. Create a status information file, and according to the progress of the document collection process,
A plurality of document collecting means for writing status information indicating the progress status of the document collecting processing into a status information file corresponding to the document collecting processing, and issued each time each of the plurality of document collecting means completes the document collecting processing. Registration means for accepting a registration request for registration, and sequentially executing registration processing for registering the result of the document collection processing requested by the registration request in the knowledge database in the order in which the registration request is received. A registration unit that writes status information indicating the progress status of the registration process to a status information file corresponding to the document collection process that is the target of the registration process, according to the progress of the registered process; Therefore, the status stored in the status information file corresponding to each document collection process started Knowledge comprising: a status display means for displaying a status list screen showing the current progress status from the start of collection to the registration regarding the document collection processing of each of the plurality of document collecting means based on the information. Information collection system.

6. The plurality of document collecting means, at least a first document collecting means for collecting document information published on an information site on a network, a file server of an electronic filing system, a database server, or an electronic filing system. The knowledge information collecting system according to claim 5, further comprising: second document collecting means for collecting document information from a bulletin board type community.

7. A knowledge information collecting method for collecting, from a network, document information to be registered in a knowledge database, based on each of a plurality of setting information indicating a collection condition of a document file group from the network, A document collection step that executes a plurality of document collection processes corresponding to each setting information, and manages the progress status from the collection start to the registration regarding the document collection process at the start of each document collection process to be processed. And a document collection step of writing status information indicating a progress status of the document collection process to a status information file corresponding to the document collection process according to the progress of the document collection process of the processing target. Is issued from the document collecting step each time each document collecting process is completed. A registration step of accepting a recording request, and sequentially executing a registration process for registering the result of the document collection process requested by the registration request in the knowledge database in the order in which the registration request is received. A registration step of writing status information indicating the progress status of the registration process to a status information file corresponding to the document collection process targeted for the registration process according to the progress of the registration process, and the document collection process is started. A status display step of displaying a status list screen showing the current progress status of each of the document collection processes based on the status information held in the status information file corresponding to each document collection process. How to collect knowledge information.

8. A knowledge information collecting method for collecting document information to be registered in a knowledge database, wherein a document collecting process for collecting a document file group from a plurality of different information sources is performed. In the document collection step to be executed, a status information file is created for each document collection process for each information source at the start of the document collection process for managing the progress of the document collection process from the start of collection to the registration. Then, according to the progress of the document collection process, a document collection step of writing status information indicating the progress status of the document collection process into a status information file corresponding to the document collection process, and a document for each of the information sources. Accepts a registration request issued from the document collection step each time the collection process is completed,
It is a registration step that sequentially executes a registration process for registering the result of the document collection process requested by the registration request in the knowledge database in the order in which the registration request is received. Together,
A registration step of writing status information indicating the progress of the registration process to a status information file corresponding to the document collection process that is the target of the registration process, and a document collection process for each of the multiple types of information sources. A status display step of displaying a status list screen showing the current progress status of each document collection process performed on the plurality of types of information sources, based on the status information held in the corresponding status information file. A method for collecting knowledge information, characterized by: