JPH10260979A

JPH10260979A - Information collecting method and device

Info

Publication number: JPH10260979A
Application number: JP9065106A
Authority: JP
Inventors: Hiroshi Matsuo; 比呂志松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-03-18
Filing date: 1997-03-18
Publication date: 1998-09-29

Abstract

PROBLEM TO BE SOLVED: To efficiently collect information by comparing an explanation text that corresponds to a link destination described on a hyperlink document with a narrowing condition and holding the link destination that coincides with the narrowing condition as an access prearranged node. SOLUTION: A start condition of search and an end condition of search are set (step S1), a narrowing condition of a link destination is set (step S2), a hyperlink document is acquired by accessing an access destination that is designated with the start condition and an access prearranged node, and the acquired hyperlink document is stored as a search result (step S3). An explanation text that corresponds to a link destination which is described on the hyperlink document is compared with a narrowing condition, and a link destination that coincides with the narrowing condition is held as an access prearranged node (step S4). The above processing is performed of entire hyperlink document to be accessed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報収集方法及び
装置に係り、特に、ハイパーリンク文書のリンク先を辿
って文書の自動収集を行う情報収集方法及び装置に関す
る。詳しくは、ネットワーク上に分散されたサーバ上に
ハイパーリンク文書が分散蓄積されている場合の情報収
集を支援するための情報収集方法及び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information collecting method and apparatus, and more particularly to an information collecting method and apparatus for automatically collecting documents following a link destination of a hyperlink document. More specifically, the present invention relates to an information collection method and apparatus for supporting information collection when hyperlink documents are distributed and stored on servers distributed on a network.

【０００２】[0002]

【従来の技術】従来の情報収集装置について説明する。
図８は、従来の情報収集装置の構成を示す。従来の情報
収集装置は、探索条件設定部Ｃ１、探索部Ｃ２及び探索
結果表示部Ｃ３から構成される。探索部Ｃ２は、ハイパ
ーリンク文書アクセス部Ｃ２１，アクセス予定ノード抽
出部Ｃ２２、終了条件判定部Ｃ２３より構成される。2. Description of the Related Art A conventional information collecting apparatus will be described.
FIG. 8 shows a configuration of a conventional information collecting apparatus. The conventional information collecting device includes a search condition setting unit C1, a search unit C2, and a search result display unit C3. The search unit C2 includes a hyperlink document access unit C21, a scheduled access node extraction unit C22, and an end condition determination unit C23.

【０００３】図９は、従来の情報収集装置の動作を示
す。ステップ１０）まず、探索条件設定部Ｃ１によ
り、初期アクセスノードと最大ノード深さを探索条件と
して設定する。ステップ１１）初期条件として、初期
アクセスノードをカレントノードとする。FIG. 9 shows the operation of a conventional information collecting apparatus. Step 10) First, the initial access node and the maximum node depth are set as search conditions by the search condition setting unit C1. Step 11) As an initial condition, an initial access node is set as a current node.

【０００４】ステップ１２）ハイパーリンク文書アク
セス部Ｃ２１により、カレントノードにアクセスし、ハ
イパーリンク文書を取得し、探索結果として保存する。
ステップ１３）終了条件判定部Ｃ２３により、カレン
トノードが最大深さにあるかを調べ、最大深さにある場
合には、次のステップ１５に移行する。最大深さにない
場合には、ステップ１４に移行する。Step 12) The current node is accessed by the hyperlink document access section C21, a hyperlink document is obtained, and stored as a search result.
Step 13) The termination condition determination unit C23 checks whether the current node is at the maximum depth. If the current node is at the maximum depth, the process proceeds to the next step 15. If it is not at the maximum depth, the process proceeds to step S14.

【０００５】ステップ１４）アクセス予定ノード抽出
部Ｃ２２により、ハイパーリンク文書に記述されたリン
ク先をアクセス予定ノードとして保存する。ステップ１
５）終了条件判定部Ｃ２３により、未処理のアクセス
予定ノードがあるかを調べ、あれば、それをカレントノ
ードとする。未処理のアクセス予定ノードがある場合に
は、ステップ１２に移行する。[0005] Step 14) The scheduled access node extracting unit C22 stores the link destination described in the hyperlink document as the scheduled access node. Step 1
5) The end condition determining unit C23 checks whether there is an unprocessed access scheduled node, and if so, sets it as the current node. If there is an unprocessed access scheduled node, the process proceeds to step S12.

【０００６】ステップ１６）探索結果表示部Ｃ３によ
り探索結果を表示する。以上の処理により、初期アクセ
スノードから最大ノード深さまでのリンクされたハイパ
ーリンク文書を全て取得することができる。Step 16) The search result is displayed on the search result display section C3. Through the above processing, all hyperlinked documents from the initial access node to the maximum node depth can be obtained.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上記従
来の装置では、リンクされたすべてのハイパーリンク文
書にアクセスするため、指定する最大ノード深さが深い
場合には、大量の文書を取得してしまう。１文書あたり
ｎ個のリンク先があり、最大ノード深さをｍと指定した
場合、ｎのｍ乗個となり、探索の深さを大きくすると、
膨大な量の文書をアクセスすることになる。例えば、ｎ
＝５，ｍ＝５とすると、５階層目で３１２５個の文書を
アクセスすることになる。However, in the above-mentioned conventional apparatus, since all linked hyperlink documents are accessed, if the designated maximum node depth is deep, a large amount of documents will be obtained. . If there are n link destinations per document and the maximum node depth is designated as m, then n is the mth power, and when the search depth is increased,
You will have access to a huge amount of documents. For example, n
= 5, m = 5, 3125 documents are accessed in the fifth hierarchy.

【０００８】しかしながら、ユーザが欲しい情報は、こ
れら文書のうちごく一部であることが多い。このため、
検索処理を施して収集した文書を絞り込んだり、人手で
必要な文書を捜し出すなどの作業に多くの労力を要す
る。また、ネットワーク上に分散配置された文書を大量
にアクセスすると、ネットワークへの大きな負担がかか
るという問題がある。However, the information desired by the user is often a very small part of these documents. For this reason,
A large amount of labor is required for narrowing down the collected documents by performing a search process and manually searching for necessary documents. In addition, there is a problem that if a large number of documents distributed and arranged on the network are accessed, a heavy load is imposed on the network.

【０００９】本発明は、上記の点に鑑みなされたもの
で、すべてのリンク先をアクセスするのではなく、絞り
込み条件を満たすリンク先のみをアクセスすることによ
って、ユーザの所望する可能性が高い文書に対するアク
セスに限定して、効率よく情報を収集することが可能な
情報収集方法及び装置を提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and provides a document which is highly likely to be desired by a user by accessing only link destinations satisfying a narrowing condition, instead of accessing all link destinations. It is an object of the present invention to provide an information collecting method and apparatus capable of efficiently collecting information by limiting access to the information.

【００１０】[0010]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明は、ネットワーク上に
分散されたサーバ上にハイパーリンク文書が分散蓄積さ
れている場合の情報収集を支援するための情報収集方法
において、探索の初期条件及び終了条件を設定し（ステ
ップ１）、リンク先の絞り込み条件を設定し（ステップ
２）、初期条件で指定されたアクセス先及びアクセス予
定ノードへアクセスし、ハイパーリンク文書を取得し、
取得したハイパーリンク文書を探索結果としての保存を
行い（ステップ３）、ハイパーリンク文書に記述された
リンク先に対応する説明テキストと絞り込み条件とを比
較し、該絞り込み条件に合致するリンク先をアクセス予
定ノードとして保持する（ステップ４）処理を、アクセ
スすべき全ハイパーリンク文書について行う。FIG. 1 is a diagram for explaining the principle of the present invention. The present invention sets an initial condition and an end condition of a search in an information collection method for supporting information collection when hyperlink documents are distributed and stored on servers distributed on a network (step 1). Setting the conditions for narrowing down the link destination (step 2), accessing the access destination and the scheduled access node specified in the initial condition, acquiring the hyperlink document,
The acquired hyperlink document is saved as a search result (step 3), the explanatory text corresponding to the link destination described in the hyperlink document is compared with the narrowing conditions, and the link destination that matches the narrowing condition is accessed. The process of storing as a scheduled node (step 4) is performed for all hyperlink documents to be accessed.

【００１１】また、本発明は、リンク先の絞り込み条件
として指定されたキーワードを、語の概念の関係を表す
シソーラス辞書に基づいて展開し、照合対象語として設
定し、ハイパーリンク文書に記述されたリンク先に対応
する説明テキスト中に、照合対象語が含まれるリンク先
をアクセス予定ノードとする。Further, according to the present invention, a keyword specified as a condition for narrowing down a link destination is developed based on a thesaurus dictionary representing a relation between word concepts, set as a collation target word, and described in a hyperlink document. The link destination in which the collation target word is included in the description text corresponding to the link destination is set as the scheduled access node.

【００１２】図２は、本発明の原理構成図である。本発
明は、ハイパーリンク文書のリンク先を辿って文書の自
動収集を行う情報収集装置であって、探索の初期条件及
び終了条件を設定する探索条件設定手段１と、リンク先
の絞り込み条件を設定する絞り込み条件設定手段２と、
ハイパーリンク文書に記述されたリンク先に対応する説
明テキストを絞り込み条件に基づいて判定し、絞り込み
条件に合うリンク先をアクセス予定ノードとするアクセ
ス候補絞り込み手段３と、初期条件で指定されたアクセ
ス先及びアクセス予定ノードへアクセスしてハイパーリ
ンク文書を取得し、該ハイパーリンク文書を探索結果と
して保存するハイパーリンク文書アクセス手段４と、初
期条件及び終了条件に基づいて、ハイパーリンク文書ア
クセス手段４及びアクセス候補絞り込み手段３を繰り返
し起動する探索管理手段５とを有する。FIG. 2 is a diagram showing the principle of the present invention. The present invention relates to an information collecting apparatus for automatically collecting documents following a link destination of a hyperlink document, wherein a search condition setting means 1 for setting an initial condition and an end condition of a search, and a condition for narrowing down a link destination are set. Narrowing condition setting means 2 to perform
An access candidate narrowing means 3 which determines a description text corresponding to a link destination described in the hyperlink document based on a narrowing condition and sets a link destination meeting the narrowing condition as a scheduled access node; and an access destination specified by the initial condition. And a hyperlink document access unit 4 for accessing the access scheduled node to acquire the hyperlink document and storing the hyperlink document as a search result, and a hyperlink document access unit 4 based on the initial condition and the end condition. A search management means 5 for repeatedly activating the candidate narrowing means 3.

【００１３】また、本発明において、語の概念の関係を
表すシソーラス辞書を更に有し、アクセス候補絞り込み
手段３は、絞り込み条件として指定されたキーワード
を、シソーラス辞書に基づいて展開して照合対象語とし
て設定する絞り込み条件設定手段と、ハイパーリンク文
書の記述されたリンク先に対応する説明テキスト中に照
合対象語が含まれるリンク先をアクセス予定ノードとす
る絞り込み手段を含む。Further, in the present invention, the access candidate narrowing means 3 further includes a thesaurus dictionary representing the relationship between the concepts of the words. And a narrowing-down means for setting a link destination in which the description target word is included in the explanation text corresponding to the link destination in which the hyperlink document is described as an access scheduled node.

【００１４】このように、本発明による情報収集方法及
び装置では、アクセスしたハイパーリンク文書に記述さ
れたリンク先に対応する説明テキストと、絞り込み条件
設定手段により設定されている絞り込み条件とを比較す
ることにより、合致するリンク先をアクセス予定ノード
として絞り込むことにより、不要な文書へのアクセスを
回避して、ユーザが所望する情報が含まれる可能性があ
る文書を効率よく収集することが可能となる。As described above, in the information collecting method and apparatus according to the present invention, the explanation text corresponding to the link destination described in the accessed hyperlink document is compared with the narrowing condition set by the narrowing condition setting means. Thus, by narrowing down the matching link destinations as scheduled access nodes, it is possible to avoid unnecessary access to documents and efficiently collect documents that may contain information desired by the user. .

【００１５】[0015]

【発明の実施の形態】図３は、本発明の情報収集装置の
構成を示す。同図における情報収集装置は、探索条件設
定部１、絞り込み条件設定部２、アクセス候補絞り込み
部３、ハイパーリンク文書アクセス部４、探索管理部
５、アクセス管理テーブル６、取得文書蓄積部７、処理
要求受付部８、探索結果表示部９、照合対象語蓄積部１
０、及びシソーラス辞書１１から構成され、当該情報収
集装置は、ネットワーク１２を介してサーバ群１３に接
続される。FIG. 3 shows the configuration of an information collecting apparatus according to the present invention. The information collecting apparatus in FIG. 1 includes a search condition setting unit 1, a narrowing condition setting unit 2, an access candidate narrowing unit 3, a hyperlink document access unit 4, a search management unit 5, an access management table 6, an acquired document storage unit 7, a process, and the like. Request receiving unit 8, search result display unit 9, target word storage unit 1
0, and a thesaurus dictionary 11, and the information collecting apparatus is connected to a server group 13 via a network 12.

【００１６】アクセス管理テーブル６は、アクセス予定
の文書の識別子（以下、アクセス予定ノードと呼ぶ）等
を保存して、文書のアクセスの管理に使用するテーブル
である。当該アクセス管理テーブル６には、アクセス予
定ノード、処理結果、ノードの深さ等の項目が定義され
る。アクセス予定ノードには、アクセス予定として決定
された文書の識別子が記述され、処理結果には、当該文
書が取得されたか否かによって、『処理済』か『未処
理』が記述され、ノード深さには、開始文書を０とし
て、開始文書から何個のリンクで結ばれているかを示す
個数が記述される。The access management table 6 is a table for storing an identifier of a document to be accessed (hereinafter, referred to as an access scheduled node) and the like, and used for managing access to the document. In the access management table 6, items such as a scheduled access node, a processing result, and a depth of the node are defined. The identifier of the document determined to be accessed is described in the scheduled access node, and “processed” or “unprocessed” is described in the processing result depending on whether or not the document is acquired. Describes the number of links from the start document to the start document, with 0 as the start document.

【００１７】取得文書蓄積部７は、ハイパーリンク文書
アクセス部４によって取得された文書を蓄積しておく。
処理要求受付部８は、ユーザとの入出力インタフェース
であり、情報入出力の機能を持つか、ネットワークを介
して端末と接続され、ユーザからの探索条件設定要求、
絞り込み条件設定要求、探索要求、探索結果出力要求を
受付、各々、探索条件設定部１、絞り込み条件設定部
２、探索管理部５、探索結果表示部９へ仲介する。The acquired document storage unit 7 stores the document acquired by the hyperlink document access unit 4.
The processing request receiving unit 8 is an input / output interface with the user, has an information input / output function, is connected to a terminal via a network, and receives a search condition setting request from the user.
The search condition setting unit 1, the search condition setting unit 2, the search management unit 5, and the search result display unit 9 are respectively accepted for a search condition setting request, a search request, and a search result output request.

【００１８】探索結果表示部９は、処理要求受付部８か
らの要求に応じて、取得文書蓄積部７に蓄積された文書
を出力する。照合対象語蓄積部１０は、アクセス候補絞
り込み部３で参照できるように、絞り込み条件設定部２
によって設定される照合対象語を保存しておく。シソー
ラス辞書１１は、語が表す概念の関係を示す情報を蓄積
した辞書であり、絞り込みを行うための知識として使用
される。The search result display section 9 outputs a document stored in the acquired document storage section 7 in response to a request from the processing request receiving section 8. The matching target word accumulating unit 10 allows the access condition narrowing unit 3 to refer to the narrowing condition setting unit 2
The collation target word set by is saved. The thesaurus dictionary 11 is a dictionary that stores information indicating relationships between concepts represented by words, and is used as knowledge for narrowing down.

【００１９】探索条件設定部１は、処理要求受付部８を
介して探索条件の要求を受け取り、探索条件の初期条件
として開始文書を設定し、終了条件として、最大ノード
深さを設定する。絞り込み条件設定部２は、処理要求受
付部８を介して絞り込み条件設定要求を受け取り、絞り
込み条件を設定する。絞り込み条件の指定の方法とし
て、キーワードを指定する方法、条件式を指定する方
法、照合パターンを指定する方法等、各種指定方法を適
用することができる。The search condition setting unit 1 receives a request for a search condition via the processing request receiving unit 8, sets a start document as an initial condition of the search condition, and sets a maximum node depth as an end condition. The narrow-down condition setting unit 2 receives a narrow-down condition setting request via the processing request receiving unit 8 and sets a narrow-down condition. As a method of specifying a narrowing condition, various specifying methods such as a method of specifying a keyword, a method of specifying a conditional expression, and a method of specifying a collation pattern can be applied.

【００２０】探索管理部５は、処理要求受付部８を介し
て探索要求を受け取ると、アクセス管理テーブル６を参
照・更新しながら、探索条件で指定された終了条件に達
するまで、アクセス候補絞り込み部３と、ハイパーリン
ク文書アクセス部４とを繰り返し起動して探索を行う。
アクセス候補絞り込み部３は、ハイパーリンク文書中か
ら、リンク先に対応する説明テキストを抽出し、照合対
象語蓄積部１０に保存された照合対象語が説明テキスト
中に含まれるリンク先をアクセス予定ノードとして抽出
し、探索管理部５を介して、アクセス管理テーブル６へ
登録する。Upon receiving the search request via the processing request receiving unit 8, the search management unit 5 refers to and updates the access management table 6 and, until the end condition specified by the search condition is reached, the access candidate narrowing unit. 3 and the hyperlink document access unit 4 are repeatedly activated to perform a search.
The access candidate narrowing-down unit 3 extracts a description text corresponding to the link destination from the hyperlink document, and determines a link destination in which the verification target word stored in the verification target word storage unit 10 is included in the description text, as an access scheduled node. And registers it in the access management table 6 via the search management unit 5.

【００２１】ハイパーリンク文書アクセス部４は、探索
管理部５からアクセスすべきノードの情報を受け取り、
ネットワーク１２を介して、サーバ群１３をアクセスし
て、該当するハイパーリンク文書を取得し、取得文書蓄
積部７へ保存する。図４は、本発明の情報収集方法の一
連の動作を示すフローチャートである。ステップ１０１）探索条件設定部１は、探索条件の初
期条件として、開始文書が設定され、終了条件として最
大ノード深さが設定される。The hyperlink document access unit 4 receives information on a node to be accessed from the search management unit 5,
The server group 13 is accessed via the network 12 to acquire a corresponding hyperlink document and store it in the acquired document storage unit 7. FIG. 4 is a flowchart showing a series of operations of the information collection method of the present invention. Step 101) The search condition setting unit 1 sets a start document as an initial condition of the search condition, and sets a maximum node depth as an end condition.

【００２２】ステップ１０２）アクセス候補絞り込み
部３は、開始文書をアクセス予定ノードとして、アクセ
ス管理テーブル６ヘ登録する。ステップ１０３）絞り込み条件設定部２は、処理要求
受付部８からの絞り込み条件を受け取り、キーワードが
絞り込み条件として設定される。ステップ１０４）絞り込み条件設定部２において、絞
り込み条件で指定されたキーワードのシソーラス辞書１
１上の位置を取得し、絞り込み条件設定部５においてそ
の同義語、上位語、下位語を抽出し、照合対象語とし、
照合対象語蓄積部１０に登録する。Step 102) The access candidate narrowing unit 3 registers the start document in the access management table 6 as a scheduled access node. Step 103) The narrowing-down condition setting unit 2 receives the narrowing-down condition from the processing request receiving unit 8, and the keyword is set as the narrowing-down condition. Step 104) The narrowing-down condition setting unit 2 sets the thesaurus dictionary 1 of the keyword specified by the narrowing-down condition.
1 is obtained, and the refinement condition setting unit 5 extracts the synonyms, upper terms, and lower terms, and sets them as matching target words.
It is registered in the collation target word storage unit 10.

【００２３】ステップ１０５）ハイパーリンク文書ア
クセス部４は、探索管理部５を介して、アクセス管理テ
ーブル６に登録されたアクセス予定ノードへアクセス
し、ハイパーリンク文書を取得して、取得文書蓄積部７
に探索結果として保存すると共に、探索管理部５を介し
て、アクセス管理テーブル６の当該アクセス予定ノード
の＜処理結果＞を『処理済』とする。Step 105) The hyperlink document access unit 4 accesses the scheduled access node registered in the access management table 6 via the search management unit 5, acquires the hyperlink document, and acquires the acquired document storage unit 7.
As a search result, and the <processing result> of the node to be accessed in the access management table 6 is set to “processed” via the search management unit 5.

【００２４】ステップ１０６）探索管理部５は、取得
した文書のノードの深さが最大ノード深さに達したかを
調べ、最大ノード深さの場合には、ステップ１０９に移
行する。そうでない場合には、ステップ１０７に移行す
る。ステップ１０７）探索管理部５は、ハイパーリンク文
書中から、リンク先に対応する説明テキストを抽出す
る。Step 106) The search management unit 5 checks whether or not the depth of the node of the acquired document has reached the maximum node depth. If not, the process proceeds to step 107. Step 107) The search management unit 5 extracts the description text corresponding to the link destination from the hyperlink document.

【００２５】ステップ１０８）アクセス候補絞り込み
部３は、説明テキスト中に照合対象語が存在するかを調
べ、存在する場合には、当該説明テキストに対応するリ
ンク先をアクセス予定ノードとしてアクセス管理テーブ
ル６へ登録する。ステップ１０９）探索管理部５は、アクセス管理テー
ブル中に『未処理』のアクセス予定ノードがあるか調
べ、ある場合には、ステップ１０５に移行する。Step 108) The access candidate narrowing unit 3 checks whether or not the matching target word exists in the explanation text. If there is, the access management table 6 sets the link corresponding to the explanation text as an access scheduled node. Register to Step 109) The search management unit 5 checks whether there is an “unprocessed” access scheduled node in the access management table, and if so, proceeds to step 105.

【００２６】ステップ１１０）ない場合には、探索結
果表示部９において、探索結果出力要求を受け取ると、
取得文書蓄積部７に蓄積された探索結果を出力する。Step 110) If no search result output request is received in the search result display section 9,
The search result stored in the obtained document storage unit 7 is output.

【００２７】[0027]

【実施例】以下、図面と共に本発明の実施例を説明す
る。図５は、本発明の一実施例のハイパーリンク文書の
例を示す。図５に示す、ハイパーリンク文書の〔ｈ００
０１〕，〔ｈ０００２〕等は、ハイパーリンク文書に付
与した識別子である。ここで示した文書（以下、ここで
は、ハイパーリンク文書を省略して単に文書と記す）
は、説明のため簡略化しており、表や画像等を含んだ文
書であってもよい。ハイパーリンク文書の場合、表示書
式等の指定を行うタグやリンク先を示すためのタグが定
義されており、これらの情報を基にリンク先に対応する
説明テキストを抽出できる。また、自然言語解析処理を
行って、タグ情報のみでは抽出できない説明テキストに
対しても抽出を行う構成としてもよい。Embodiments of the present invention will be described below with reference to the drawings. FIG. 5 shows an example of a hyperlink document according to an embodiment of the present invention. [H00] of the hyperlink document shown in FIG.
[01], [h0002], etc. are identifiers given to the hyperlink document. Document shown here (hereafter, hyperlink document is omitted here and simply written as document)
Is simplified for explanation, and may be a document including a table, an image, and the like. In the case of a hyperlink document, a tag for designating a display format or the like and a tag for indicating a link destination are defined, and an explanatory text corresponding to the link destination can be extracted based on such information. Further, a configuration may be adopted in which natural language analysis processing is performed to extract explanation text that cannot be extracted only with tag information.

【００２８】文書〔ｈ０００１〕では、文書〔ｈ００１
１〕へのリンクに対し、「コンピュータ」が説明テキス
トに対応しており、文書〔ｈ００１２〕へのリンクに対
し、「通販」が説明テキストに対応している。同様に、
「企業」、「アート」、「メディア」は、各々〔ｈ００
１３〕，〔ｈ００１４〕，〔ｈ００１５〕のリンクの説
明テキストである。リンク先とそれに対応するテキスト
とを、（＜リンク先＞、＜説明テキスト＞）の形式で表すと、文書〔ｈ０００１〕には、リンク情報
として、（ｈ００１１，コンピュータ）（ｈ００１２，通販）（ｈ００１３，企業）（ｈ００１４，アート）（ｈ００１５，メディア）が存在する。In the document [h0001], the document [h001]
For the link to [1], “computer” corresponds to the description text, and for the link to the document [h0012], “mail order” corresponds to the description text. Similarly,
"Company", "Art", and "Media" are each [h00
13], [h0014], and [h0015] are explanatory texts of links. When the link destination and the corresponding text are represented in the form of (<link destination>, <description text>), the document [h0001] contains (h0011, computer) (h0012, mail order) as link information. , Company) (h0014, art) (h0015, media).

【００２９】図６は、本発明の一実施例のシソーラス辞
書の例を示す。同図に示すシソーラス辞書１１は、語が
表す概念の関係を示す情報を蓄積した辞書であり、絞り
込みを行うための知識として使用される。同図の例で
は、両方向の矢印で同義語を表し、一方向の矢印で、上
位・下位関係を表している。例えば、「販売」、「ショ
ッピング」、「shopping」は、同義語であり、これらの
語の下位語は、「通販」、「小売り」である。「小売
り」から見ると、上位語は「販売」であり、下位語は、
「百貨店」、「専門店」である。FIG. 6 shows an example of a thesaurus dictionary according to an embodiment of the present invention. The thesaurus dictionary 11 shown in FIG. 1 is a dictionary that stores information indicating relationships between concepts represented by words, and is used as knowledge for narrowing down. In the example shown in the figure, a double-headed arrow indicates a synonym, and a one-way arrow indicates an upper / lower relationship. For example, “sales”, “shopping”, and “shopping” are synonyms, and the lower terms of these words are “mail order” and “retail”. From a “retail” perspective, the broader term is “sales,”
They are "department stores" and "specialty stores".

【００３０】ここでは、キーワードを指定して絞り込み
条件とする一例として、シソーラス辞書１１を参照して
照合対象語に展開し、この照合対象語が説明テキスト中
に含まれるか否かによって、リンク先を絞り込むよう構
成した場合を例に説明する。本実施例では、アクセス候
補絞り込み部３での参照が容易となるよう、絞り込み条
件設定部２において、指定されたキーワードの同義語、
上位語、下位語を照合対象語として照合対象蓄積部１０
へ保持する。Here, as an example of specifying a keyword as a narrowing-down condition, it is expanded into a collation target word with reference to the thesaurus dictionary 11, and a link destination is determined by whether or not the collation target word is included in the description text. The following describes an example of a case in which the search is narrowed down. In the present embodiment, in order to make the reference in the access candidate narrowing unit 3 easy, the narrowing condition setting unit 2 sets the synonym of the designated keyword,
Higher-order words and lower-order words are set as matching target words, and the matching storage unit 10
To hold.

【００３１】図７は、本発明の一実施例のアクセス管理
テーブルの例を示す。同図（ａ）は、文書アクセス前の
テーブルの状態を示し、（ｂ）は、〔ｈ０００１〕文書
取得後のテーブルの状態を示し、（ｃ）は、〔ｈ００１
２〕文書取得後のテーブルの状態を示す。以下、図４の
フローチャートに従って、文書〔ｈ０００１〕が開始文
書として設定され、最大ノード深さとして、“２”が設
定された場合について説明する。FIG. 7 shows an example of an access management table according to an embodiment of the present invention. 11A shows the state of the table before document access, FIG. 10B shows the state of the table after [h0001] document acquisition, and FIG. 10C shows the state of [h001].
2] Shows the state of the table after document acquisition. Hereinafter, the case where the document [h0001] is set as the start document and "2" is set as the maximum node depth will be described with reference to the flowchart of FIG.

【００３２】ステップ１０１）処理要求受付部８よ
り、探索条件設定の要求を受け取り、探索条件の初期条
件として、開始文書が設定され、終了条件として、最大
ノード深さが設定される。即ち、開始文書として文書
〔ｈ０００１〕が設定され、最大ノード深さとして
“２”が設定される。ステップ１０２）探索の初期条件として指定された開
始文書をアクセス予定ノードとしてアクセス管理テーブ
ル６に登録する。指定される開始文書は、複数個であて
もよい。ここでは、図５の文書〔ｈ０００１〕が開始文
書として指定された場合について説明する。開始文書
〔ｈ０００１〕に対しては、図７（ａ）に示すように、
＜アクセス予定ノード＞の項に「ｈ０００１」が、＜処
理結果＞の項に「未処理」が、＜ノードの深さ＞の項に
「０」が記述される。Step 101) A request for setting a search condition is received from the processing request receiving unit 8, a start document is set as an initial condition of the search condition, and a maximum node depth is set as an end condition. That is, the document [h0001] is set as the start document, and “2” is set as the maximum node depth. Step 102) The start document specified as the initial condition of the search is registered in the access management table 6 as a scheduled access node. The specified start document may be plural. Here, a case where the document [h0001] in FIG. 5 is designated as the start document will be described. For the start document [h0001], as shown in FIG.
“H0001” is described in the item of <access scheduled node>, “unprocessed” is described in the item of <processing result>, and “0” is described in the item of <node depth>.

【００３３】ステップ１０３）絞り込み条件設定部２
において、処理要求受付部８からの、絞り込み条件設定
要求を受け取り、キーワードが絞り込み条件として指定
される。ここでは、キーワードとして、「ショッピン
グ」と「家具」が指定されたものとする。ステップ１０４）絞り込み条件設定部２において、シ
ソーラス辞書１１を参照し、絞り込み条件で指定された
キーワードの同義語、上位語、下位語を抽出し、照合対
象語とし、これを照合対象語蓄積部１０へ保存する。図
６のシソーラス辞書を使用した場合には、以下のように
照合対象語が決定される。Step 103) Refinement condition setting unit 2
, A narrowing condition setting request from the processing request receiving unit 8 is received, and a keyword is designated as the narrowing condition. Here, it is assumed that “shopping” and “furniture” are designated as keywords. Step 104) The narrowing-down condition setting unit 2 refers to the thesaurus dictionary 11, extracts synonyms, high-order words, and low-order words of the keyword specified by the narrowing-down condition, sets them as matching target words, and uses these as matching target word storage units 10. Save to When the thesaurus dictionary of FIG. 6 is used, the collation target words are determined as follows.

【００３４】「ショッピング」に対する照合対象語とし
ては、同義語として、「販売」、「ショッピング」、
「shopping」が抽出され、下位語として「通販」、「小
売り」が抽出され、「小売り」の下位語として「百貨
店」、「専門店」が抽出される。「家具」に対しては、
上位語として、「商品」が、下位語として、「机」、
「椅子」が抽出される。The terms to be collated with “shopping” include synonyms “sale”, “shopping”,
“Shopping” is extracted, and “mail order” and “retail” are extracted as lower words, and “department store” and “specialty store” are extracted as lower words of “retail”. For "furniture",
As a broader term, “product” is used as a broader term, and “desk”,
"Chairs" are extracted.

【００３５】即ち、照合対象語は、「販売」、「ショッ
ピング」、「shopping」、「通販」、「小売り」、「百
貨店」、「専門店」、「商品」、「机」、「椅子」とな
り、これらの語が照合対象語蓄積部１０に蓄積される。ステップ１０５）アクセス管理テーブル６に登録され
たアクセス予定ノードにおいて、＜処理結果＞が『未処
理』のノードへネットワーク１２を介してアクセスし、
ハイパーリンク文書を取得する。そして、探索結果とし
て取得文書蓄積部７へ取得したハイパーリンク文書を保
存すると共に、当該アクセス予定ノードの＜処理結果＞
を『処理済』とする。ここでは、説明の簡略化のため、
リンク先をハイパーリンク文書の識別子のみで表してい
るが、リンク先としてハイパーリンク文書の所在を表す
情報、即ち、どのサーバに存在するかという情報も含ん
でおり、どのサーバへアクセスすればよいかは決定でき
る。That is, the words to be collated are “sale”, “shopping”, “shopping”, “mail order”, “retail”, “department store”, “specialty store”, “product”, “desk”, “chair”. And these words are stored in the matching target word storage unit 10. Step 105) In the node to be accessed registered in the access management table 6, access is made to the node whose <processing result> is "unprocessed" via the network 12,
Get a hyperlink document. Then, the acquired hyperlink document is stored in the acquired document storage unit 7 as a search result, and the <processing result> of the access scheduled node is stored.
As “processed”. Here, for simplicity of explanation,
Although the link destination is represented only by the identifier of the hyperlink document, it also includes information indicating the location of the hyperlink document as the link destination, that is, information on which server exists, and which server should be accessed. Can be determined.

【００３６】ステップ１０６）取得した文書のノード
深さが最大ノード深さに達したかを調べる。アクセス管
理テーブル６の＜ノード深さ＞を検査することにより、
最大ノード深さに達したかどうかを調べることができ
る。最大ノード深さに達した場合には、ステップ１０
７、ステップ１０８をスキップして、ステップ１０９に
移行する。Step 106) It is checked whether the node depth of the acquired document has reached the maximum node depth. By checking <node depth> in the access management table 6,
You can check if the maximum node depth has been reached. If the maximum node depth has been reached, step 10
7. Skip step 108 and proceed to step 109.

【００３７】ステップ１０７）取得したハイパーリン
ク文書中から、リンク先に対応する説明テキストを抽出
する。文書〔ｈ０００１〕の場合には、（ｈ００１１，
コンピュータ）、（ｈ００１２，通販）、（ｈ００１
３，企業）、（ｈ００１４，アート）、（ｈ００１５，
メディア）が抽出される。Step 107) Extract an explanatory text corresponding to the link destination from the obtained hyperlink document. In the case of document [h0001], (h0011,
(Computer), (h0012, mail order), (h001
3, companies), (h0014, art), (h0015,
Media) is extracted.

【００３８】ステップ１０８）説明テキスト中に照合
対象語が存在するかを調べ、存在する場合には、当該説
明テキストに対応するリンク先をアクセス予定ノードと
してアクセス管理テーブル６へ登録する。文書〔ｈ００
０１〕の場合には、「通販」のみが、（ｈ００１２，通
販）として存在する。そこで、文書〔ｈ００１２〕をア
クセス管理テーブル６へ登録する。このとき、＜処理結
果＞に『未処理』を記述し、＜ノードの深さ＞は親のノ
ードの深さに１を加算したものを記述する。文書〔ｈ０
００１〕の＜ノードの深さ＞は、０であるので、文書
〔ｈ００１２〕の＜ノードの深さ＞は１となる。その結
果、アクセス管理テーブル６の内容は、図７（ｂ）に示
す通りである。Step 108) It is checked whether or not the collation target word exists in the explanation text. If there is, the link destination corresponding to the explanation text is registered in the access management table 6 as a scheduled access node. Document [h00
01], only “mail order” exists as (h0012, mail order). Therefore, the document [h0012] is registered in the access management table 6. At this time, "unprocessed" is described in <processing result>, and <node depth> describes a value obtained by adding 1 to the depth of the parent node. Document [h0
001] is 0, so the document [h0012] <node depth> is 1. As a result, the contents of the access management table 6 are as shown in FIG.

【００３９】ステップ１０９）アクセス管理テーブル
６中に、『未処理』のアクセス予定ノードがあるかを調
べ、ある場合にはステップ１０５に移行し、『未処理』
のアクセス予定ノードがない場合には、ステップ１１０
へ移行する。ステップ１１０）処理要求受付部８よ
り、探索結果出力要求を受け取ると、探索結果表示部９
は、取得文書蓄積部７に蓄積された探索結果を処理要求
受付部８に出力する。Step 109) Check if there is an “unprocessed” access scheduled node in the access management table 6, and if there is, go to step 105 and execute “unprocessed”.
If there is no scheduled access node, step 110
Move to. Step 110) Upon receiving the search result output request from the processing request receiving unit 8, the search result display unit 9
Outputs the search result stored in the acquired document storage unit 7 to the processing request receiving unit 8.

【００４０】上記で説明した処理により、図５に示すハ
イパーリンク文書の場合には、ステップ１０５からステ
ップ１１０に至るまでの処理は、以下のようになる。（１）初期条件の開始文書として指定された文書〔ｈ
０００１〕が取得される（ステップ１０５）。（２）文書〔ｈ０００１〕の＜ノード深さ＞は、０で
あり、最大ノード深さ２に達していないため、次のステ
ップ１０７に移行する（ステップ１０６）。According to the processing described above, in the case of the hyperlink document shown in FIG. 5, the processing from step 105 to step 110 is as follows. (1) The document [h designated as the starting document of the initial condition
0001] is obtained (step 105). (2) Since <node depth> of the document [h0001] is 0 and has not reached the maximum node depth 2, the process proceeds to the next step 107 (step 106).

【００４１】（３）文書〔ｈ０００１〕から、説明テ
キスト（ｈ００１１，コンピュータ）、（ｈ００１２，
通販）、（ｈ００１３、企業）、（ｈ００１４，アー
ト）、（ｈ００１５，メディア）が抽出される（ステッ
プ１０７）。(3) From the document [h0001], the explanation text (h0011, computer), (h0012,
(Mail order), (h0013, company), (h0014, art), (h0015, media) are extracted (step 107).

【００４２】（４）照合対象語の「通販」が（ｈ００
１２，通販）に存在するので、〔ｈ００１２〕をアクセ
ス予定ノードとしてアクセス管理テーブル６に登録す
る。（５）〔ｈ００１２〕が未処理なので、ステップ１０
５に移行する（ステップ１０９）。（６）文書〔ｈ００１２〕が取得される（ステップ１
０５）。(4) If the word “mail order” to be checked is (h00
12, [mail order], so that [h0012] is registered in the access management table 6 as a scheduled access node. (5) Since [h0012] has not been processed, step 10
The process proceeds to step 5 (step 109). (6) Document [h0012] is acquired (step 1)
05).

【００４３】（７）文書〔ｈ００１２〕の＜ノード深
さ＞は１であり、最大ノード深さ２に達していないた
め、次のステップ１０７に移行する（ステップ１０
６）。（８）文書〔ｈ００１２〕から、説明テキスト（ｈ０
１２１，百貨店）、（ｈ０１２２，ショッピングモー
ル）、（ｈ０１２３，食品）、（ｈ０１２４，コンピュ
ータ）、（ｈ０１２５，書籍）が抽出される（ステップ
１０７）。(7) Since <node depth> of the document [h0012] is 1 and has not reached the maximum node depth of 2, the process proceeds to the next step 107 (step 10).
6). (8) From the document [h0012], the description text (h0
121, a department store), (h0122, shopping mall), (h0123, food), (h0124, computer), (h0125, book) are extracted (step 107).

【００４４】（９）照合対象語「百貨店」が（ｈ０１
２１，百貨店）に存在し、照合対象語「ショッピング」
は（ｈ０１２２，ショッピングモール）の文字列“ショ
ッピングモール”中に含まれているので、〔ｈ０１２
１〕及び〔ｈ０１２２〕がアクセス予定ノードとしてア
クセス管理テーブル６に登録される。ここで、〔ｈ０１
２１〕及び〔ｈ０１２２〕が未処理なので、ステップ１
０５へ移行する（ステップ１０９）。(9) The matching target word "department store" is (h01
21, department store), and the matching target word "shopping"
Is included in the character string “shopping mall” of (h0122, shopping mall), so [h012
1] and [h0122] are registered in the access management table 6 as scheduled access nodes. Here, [h01
21] and [h0122] have not been processed, so step 1
05 (step 109).

【００４５】（１０）〔ｈ０１２１〕及び〔ｈ０１２
２〕が未処理なので、ステップ１０５に移行する（ステ
ップ１０９）。（１１）文書〔ｈ０１２１〕が取得される（ステップ
１０５）。（１２）文書〔ｈ０１２１〕の＜ノード深さ＞は
“２”であり、最大ノード深さ“２”に達しているた
め、ステップ１０７、ステップ１０８をスキップして、
ステップ１０９に移行する（ステップ１０６）。(10) [h0121] and [h012]
Since 2] has not been processed, the process proceeds to step 105 (step 109). (11) The document [h0121] is obtained (step 105). (12) Since <node depth> of the document [h0121] is “2” and has reached the maximum node depth “2”, steps 107 and 108 are skipped.
The process proceeds to step 109 (step 106).

【００４６】（１３）〔ｈ０１２２〕が『未処理』と
して残っているので、ステップ１０５に移行する（ステ
ップ１０９）。（１４）文書〔ｈ０１２２〕が取得される（ステップ
１０５）。（１５）文書〔ｈ０１２２〕の＜ノード深さ＞は、
“２”であり、最大ノード深さ“２”に達しているた
め、ステップ１０７、ステップ１０８をスキップして、
ステップ１０９に処理が移行する（ステップ１０６）。(13) Since [h0122] remains as "unprocessed", the flow shifts to step 105 (step 109). (14) The document [h0122] is obtained (step 105). (15) <node depth> of document [h0122]
Since it is “2” and has reached the maximum node depth “2”, steps 107 and 108 are skipped,
The process proceeds to step 109 (step 106).

【００４７】（１６）未処理として残っているアクセ
ス予定ノードは存在しないので、ステップ１１０に移行
する（ステップ１０９）。（１７）探索結果出力の要求があると、取得文書蓄積
部７に蓄積された文書〔ｈ０００１〕，〔ｈ００１
２〕，〔ｈ０１２１〕，〔ｈ０１２２〕が出力される。(16) Since there is no access scheduled node remaining as unprocessed, the process proceeds to step 110 (step 109). (17) When a search result output request is received, the documents [h0001] and [h001] stored in the obtained document storage unit 7
2], [h0121], and [h0122] are output.

【００４８】このようにして、絞り込まれたリンク先の
みへアクセスが行われる。上記の例のように、例えば、
「家具を買いたい」という動機によって、「ショッピン
グ」及び「家具」を指定すれば、ユーザが所望する情報
が含まれる可能性のある文書以外へのアクセスが抑制さ
れる。このため、最大ノード深さを大きくとっても合理
的な時間で、目的にあった情報を効率的に収集できる。In this way, access is made only to the narrowed link destination. As in the example above, for example,
If "shopping" and "furniture" are designated by the motive of "furnishing furniture", access to documents other than those that may include information desired by the user is suppressed. Therefore, it is possible to efficiently collect desired information in a reasonable time even if the maximum node depth is increased.

【００４９】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内で種々変更・応用が可能
である。The present invention is not limited to the above embodiment, but can be variously modified and applied within the scope of the claims.

【００５０】[0050]

【発明の効果】上述のように、本発明の情報収集方法及
び装置によれば、絞り込み条件でリンク先を絞り込むた
め、アクセスする文書量を大幅に削減でき、所望する情
報が含まれる可能性が高い文書を容易に取得することが
できる。このため、本発明による装置を用いることによ
り、効率的に情報収集を行うことができる。As described above, according to the information collection method and apparatus of the present invention, since the link destination is narrowed down by the narrowing-down condition, the amount of documents to be accessed can be greatly reduced, and the possibility that desired information is included. High documents can be easily obtained. Therefore, by using the apparatus according to the present invention, information can be collected efficiently.

【００５１】従来の装置との比較を定量的に行うと以下
のようになる。従来の装置では、すべてのリンク先に対
してアクセスを行い、文書を取得する。このため、１文
書あたりｎ個のリンク先があり、ｍ階層までアクセスを
行うとすると、ｍ階層目の文書個数は、ｎのｍ乗個とな
り、探索の深さを大きくすると、膨大な量の文書をアク
セスすることになる。A quantitative comparison with the conventional apparatus is as follows. In the conventional device, all the link destinations are accessed to obtain a document. Therefore, if one document has n link destinations and access is performed up to the m-th hierarchy, the number of documents in the m-th hierarchy is n to the m-th power, and when the search depth is increased, an enormous amount of You will have access to the document.

【００５２】これに対し、本発明の情報収集装置では、
リンク先絞り込み条件によりリンク先を絞り込むため、
アクセスする文書数を削減できる。具体的には、１文書
当たり、１／ｋに絞り込まれたとすると、ｍ階層目にア
クセスする文書数は、１／（ｋのｍ乗）に削減される。
例えば、１／３に絞り込まれたとすると、５階層目で
は、絞り込みを行わない場合に比べ、１／２４３に削減
される。On the other hand, in the information collecting apparatus of the present invention,
In order to narrow down the link destination according to the link narrowing condition,
The number of documents to be accessed can be reduced. Specifically, if one document is narrowed down to 1 / k, the number of documents accessed in the m-th hierarchy is reduced to 1 / (k to the m-th power).
For example, if the screen is narrowed down to 1/3, the number is reduced to 1/243 at the fifth hierarchical level as compared with the case where no narrowing is performed.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の情報収集装置の構成図である。FIG. 3 is a configuration diagram of an information collection device of the present invention.

【図４】本発明の情報収集装置の動作のフローチャート
である。FIG. 4 is a flowchart of the operation of the information collection device of the present invention.

【図５】本発明の一実施例のハイパーリンク文書の例で
ある。FIG. 5 is an example of a hyperlink document according to an embodiment of the present invention.

【図６】本発明の一実施例のシソーラス辞書の例であ
る。FIG. 6 is an example of a thesaurus dictionary according to an embodiment of the present invention.

【図７】本発明のアクセス管理テーブルの例である。FIG. 7 is an example of an access management table of the present invention.

【図８】従来の情報収集装置の構成図である。FIG. 8 is a configuration diagram of a conventional information collection device.

【図９】従来装置の処理のフローチャートである。FIG. 9 is a flowchart of a process performed by a conventional device.

【符号の説明】１探索条件設定手段，探索条件設定部２絞り込み条件設定手段，絞り込み条件設定部３アクセス候補絞り込み手段，アクセス候補絞り込み
部４ハイパーリンク文書アクセス手段，ハイパーリンク
文書アクセス部５探索管理手段，探索管理部６アクセス管理テーブル７取得文書蓄積部８処理要求受付部９探索結果表示部１０照合対象語蓄積部１１シソーラス辞書１２ネットワーク１３サーバ群[Description of Signs] 1 search condition setting means, search condition setting unit 2 narrowing condition setting means, narrowing condition setting unit 3 access candidate narrowing means, access candidate narrowing unit 4 hyperlink document accessing means, hyperlink document accessing unit 5 search management Means, search management unit 6 access management table 7 acquired document storage unit 8 processing request reception unit 9 search result display unit 10 collation target word storage unit 11 thesaurus dictionary 12 network 13 server group

Claims

[Claims]

1. An information collection method for supporting information collection in a case where hyperlink documents are distributed and stored on servers distributed on a network, comprising: setting an initial condition and an end condition of a search; Setting the narrowing-down condition, accessing the access destination and the scheduled access node specified in the initial condition, acquiring the hyperlink document, holding the hyperlink document as a search result, and Comparing the explanatory text corresponding to the described link destination with the narrowing-down condition, and holding a link destination that matches the narrowing-down condition as a scheduled access node for all hyperlink documents to be accessed. Information collection method to be used.

2. A keyword specified as a condition for narrowing down a link destination is developed based on a thesaurus dictionary representing a relationship between word concepts, and set as a matching target word, and a link destination described in the hyperlink document is set. 2. The information collection method according to claim 1, wherein a link destination including the collation target word in the description text corresponding to the target text is set as an access scheduled node.

3. An information collecting apparatus for automatically collecting documents following a link destination of a hyperlinked document, comprising: a search condition setting means for setting an initial condition and an end condition of a search; and setting a condition for narrowing down a link destination. Access candidate narrowing means for determining a description text corresponding to a link destination described in a hyperlink document based on the narrowing condition, and setting a link destination that meets the condition as an access scheduled node; A hyperlink document access unit for accessing the access destination designated by the condition and the access scheduled node to obtain a hyperlink document, and storing the hyperlink document as a search result, based on the initial condition and the end condition Repeating the hyperlink document access means and the access candidate narrowing means. Information collection apparatus characterized by having a search management means for start.

4. A method according to claim 1, further comprising a thesaurus dictionary representing a relationship between word concepts, wherein said access candidate narrowing unit expands a keyword specified as said narrowing condition based on said thesaurus dictionary and sets it as a matching target word. 4. The information collecting apparatus according to claim 3, further comprising: a narrowing-down condition setting unit that performs the search, and a narrowing-down unit that sets a link destination including the matching target word in an explanatory text corresponding to the link destination described in the hyperlink document as an access scheduled node. apparatus.