JP2001060198A

JP2001060198A - Information collecting method and recording medium recording information collection program

Info

Publication number: JP2001060198A
Application number: JP11234284A
Authority: JP
Inventors: Kazunori Fujimoto; 和則藤本; Mitsunobu Shimazu; 光伸島津
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-08-20
Filing date: 1999-08-20
Publication date: 2001-03-06

Abstract

PROBLEM TO BE SOLVED: To provide an information collecting method for appropriately acquiring an information page deeply related with a keyword out of a little links of a network. SOLUTION: This information page collecting method has a step for acquiring the link candidate list of related information pages from the sets of retrieval group sets pairing two words of retrieval word and object word for each of retrieval words more than one determined beforehand concerning the object word, a set for acquiring the information page pointed by a link concerning all the links of the link candidate list and adding the link of information pages to Hub link list when the number of link published on the information page is more than a predetermined number and a step for tracing links for predetermined steps among the links on the Hub link list, acquiring all the information pages pointed by the traced links and defining them as an information page set.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、必要な情報をイン
ターネットから収集する方法に関し、特に複数のテキス
トがリンクによって関連づけられたハイパーテキスト
（以降、Ｗｅｂページと称する）を扱う知識ベースの自
動構築に利用される情報収集方法および情報収集プログ
ラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for collecting necessary information from the Internet, and more particularly to an automatic construction of a knowledge base for handling hypertext (hereinafter referred to as "Web page") in which a plurality of texts are linked by links. The present invention relates to an information collection method and a recording medium on which an information collection program is recorded.

【０００２】[0002]

【従来の技術】近年、インターネットの普及により、多
くのテキストが電子的に提供されるようになってきた。
こうした状況に鑑みて、知的システムや機械翻訳に利用
する知識ベースをインターネット上のテキストから自動
的に構成しようとする研究が行われている。こうした知
識ベースの自動構成にあたっては、インターネット上の
膨大なテキスト群から、いかにして知識ベースの構成に
役立つテキストを効率的に収集するかが重要な技術的課
題となる。2. Description of the Related Art In recent years, with the spread of the Internet, many texts have been provided electronically.
In view of these circumstances, research has been conducted to automatically construct a knowledge base used for intelligent systems and machine translation from texts on the Internet. In the automatic construction of such a knowledge base, an important technical issue is how to efficiently collect texts useful for the construction of a knowledge base from a huge text group on the Internet.

【０００３】例として、デジタルカメラの性能を判定す
る知的システムを取り上げる。こうした知的システムの
知識ベースを構成するには、各機種について「ＣＣＤは
何万画素であるか。」や「その特徴はどういったもの
か。」といった知識が必要となる。こうした知識は、デ
ィジタルカメラのメーカが提供するＷｅｂページに記述
されていることが多い（例えば、製品発表記事などで
は、製品の主な特徴や仕様表が提供されることが多
い）。したがって、こうしたＷｅｂページを集めること
ができれば、テキストからの知識獲得技術を用いて、知
識ベースを自動構成することが可能となる。As an example, consider an intelligent system for determining the performance of a digital camera. In order to construct a knowledge base of such an intelligent system, it is necessary for each model to have knowledge such as "What is the number of pixels of a CCD?" And "What are its features?" Such knowledge is often described on a Web page provided by a digital camera maker (for example, a product announcement article or the like often provides a main feature or specification table of a product). Therefore, if such Web pages can be collected, a knowledge base can be automatically configured using a technology for acquiring knowledge from text.

【０００４】インターネットからテキストを集めるにあ
たっては、従来からキーワードの出現チェックに基づく
検索が用いられてきた。こうした方法では、まず、知的
システムの扱う対象（上の例では、デジタルカメラ）を
キーワードとし、該キーワードが現れるテキスト中のリ
ンクを検索する。そして、こうして得られたリンクの指
すテキストを一つ一つ調べることにより必要なテキスト
を取り出すというものである。In collecting texts from the Internet, a search based on a keyword appearance check has been conventionally used. In such a method, first, a target (a digital camera in the above example) handled by the intelligent system is used as a keyword, and a link in text in which the keyword appears is searched. Then, necessary text is extracted by examining the text pointed to by the link thus obtained one by one.

【０００５】[0005]

【発明が解決しようとする課題】しかし、インターネッ
ト上には、デジタルカメラなどの単語を含むテキストは
膨大に存在し、あるキーワードで検索した結果、膨大な
数のリンクを得ることが多い。こうした場合、知識ベー
スに必要なテキストを見つけるには、膨大なリンクを一
つ一つ調べなければならないので、多くの労力／時間を
要することになってしまう。こうした問題を解決するた
め、キーワードの出現頻度に着目した検索を用いる方法
も提案されている。However, on the Internet, there are a huge number of texts including words of digital cameras and the like, and an enormous number of links are often obtained as a result of a search using a certain keyword. In such a case, finding a text necessary for the knowledge base requires a large number of links to be searched one by one, so that much labor / time is required. In order to solve such a problem, a method using a search focusing on the frequency of occurrence of a keyword has been proposed.

【０００６】この方法では、キーワードの出現頻度に着
目し、そのキーワードによる出現頻度の大きいテキスト
のリンクを優先して提示するものである。しかしなが
ら、知識ベースの構成に必要となるテキストは、キーワ
ードの多く出現するテキストとは限らない。例えば、メ
ーカの提示するデジタルカメラの製品発表記事などで
は、デジタルカメラというキーワードは、冗長な記述と
なるので、それほど多く出現しない。したがって、こう
した出現頻度に基づく方法を用いたとしても、知識ベー
スの構成に必要となるテキストを指すリンクを絞り込む
ことはできず、結局、多くのリンクから必要なテキスト
を探し出すという困難を回避することはできない。In this method, attention is paid to the frequency of appearance of a keyword, and a text link having a high frequency of appearance by the keyword is presented with priority. However, the text required for the construction of the knowledge base is not necessarily the text in which many keywords appear. For example, in a product announcement article of a digital camera presented by a manufacturer, the keyword of a digital camera is a redundant description, and thus does not appear so often. Therefore, even if such a method based on the frequency of occurrence is used, it is not possible to narrow down the links pointing to the texts necessary for the construction of the knowledge base, and in the end, avoid the difficulty of finding the necessary text from many links. Can not.

【０００７】以上のように、従来技術は、キーワードの
出現するテキストのリンクを提示するという方法であっ
たため、多くのリンクが検索されてしまい、そこから適
切なテキストを探し出すのに膨大な労力／時間を要する
という問題点があった。As described above, since the prior art is a method of presenting a link of a text in which a keyword appears, many links are searched, and enormous effort is required to search for an appropriate text therefrom. There is a problem that it takes time.

【０００８】本発明の目的は、キーワードと関連の深い
Ｗｅｂページを収集するにあたって、キーワードと関
連の深い選択されたリンクが集められたＨｕｂページを
利用して、厳選された少ないリンクから適切なＷｅｂペ
ージを取得する情報収集方法及び情報収集プログラムを
記録した記録媒体を提供することである。[0008] An object of the present invention is to collect Web pages closely related to a keyword by using a Web page in which selected links closely related to a keyword are collected, and to select appropriate Web pages from a small number of carefully selected links. It is an object of the present invention to provide an information collection method for acquiring a page and a recording medium on which an information collection program is recorded.

【０００９】[0009]

【課題を解決するための手段】本発明の情報ページ収集
方法は、ある一つのキーワードである対象単語が与えら
れたとき、「複数のテキストがリンクによって関連づけ
られたハイパーテキストの集合」（以降、ネットワーク
と称する）から「対象単語に関連の深いハイパーテキス
ト」（以降、情報ページと称する）の集合を取得する情
報収集方法において、対象単語の示す対象について記述
された情報ページのリンクを集めたＨｕｂページを指す
リンク候補リストを取得するにあたって、あらかじめ定
められた一つ以上の単語である検索単語のそれぞれにつ
いて、該検索単語と対象単語の二つの単語を組にした検
索組の集合を構成し、ネットワーク上の情報ページにつ
いて、検索組の少なくとも一つと関連のある情報ページ
のリンクをならべたリンク候補リストを取得するリンク
候補取得ステップと、リンク候補リストからＨｕｂペー
ジのリンクのみからなるＨｕｂリンクリストを取り出す
にあたって、リンク候補リストの全てのリンクについ
て、該リンクの指す情報ページを取得し、該情報ページ
に掲載されているリンクの個数があらかじめ定められた
数より大きいときに、該情報ページのリンクをＨｕｂリ
ンクリストに追加するＨｕｂページ判定ステップと、Ｈ
ｕｂリンクリストから情報ページの集合を取得するにあ
たって、Ｈｕｂリンクリスト中のリンクから、あらかじ
め定められた段数だけリンクをたどり、たどられたリン
クの指す情報ページを全て取得して情報ページ集合と
し、該情報ページ集合をもって対象単語と関連の深い情
報ページの集合とするページ取得ステップを有すること
を特徴とする。According to the information page collecting method of the present invention, when a target word which is a certain keyword is given, "a set of hypertexts in which a plurality of texts are linked by a link" In an information collection method for obtaining a set of “hypertexts closely related to a target word” (hereinafter referred to as an information page) from a network, a Hub that collects links of information pages describing a target indicated by the target word When acquiring a link candidate list pointing to a page, for each of the search words that are one or more predetermined words, configure a set of search groups in which two words of the search word and the target word are set, For information pages on the network, link the information pages that are relevant to at least one of the search sets. In a link candidate obtaining step of obtaining a link candidate list, and in extracting a Hub link list including only links of Hub pages from the link candidate list, an information page indicated by the link is obtained for all links in the link candidate list. A hub page determining step of adding the link of the information page to a hub link list when the number of links on the information page is larger than a predetermined number;
When obtaining a set of information pages from the web link list, the links in the hub link list are followed by a predetermined number of steps, and all the information pages indicated by the searched links are obtained as an information page set. The method further comprises a page acquisition step of setting the information page set to a set of information pages closely related to the target word.

【００１０】また、本発明の情報収集プログラムを記録
した記録媒体は、ある一つのキーワードである対象単語
が与えられたとき、ネットワークから情報ページの集合
を取得する情報収集プログラムを記録した記録媒体であ
って、対象単語の示す対象について記述された情報ペー
ジのリンクを集めたＨｕｂページを指すリンク候補リス
トを取得するにあたって、あらかじめ定められた一つ以
上の単語である検索単語のそれぞれについて、該検索単
語と対象単語の二つの単語を組にした検索組の集合を構
成し、ネットワーク上の情報ページについて、検索組の
少なくとも一つと関連のある情報ページのリンクをなら
べたリンク候補リストを取得するリンク候補取得プログ
ラムと、リンク候補リストからＨｕｂページのリンクの
みからなるＨｕｂリンクリストを取り出すにあたって、
リンク候補リストの全てのリンクについて、該リンクの
指す情報ページを取得し、該情報ページに掲載されてい
るリンクの個数があらかじめ定められた数より大きいと
きに、該情報ページのリンクをＨｕｂリンクリストに追
加するＨｕｂページ判定プログラムと、Ｈｕｂリンクリ
ストから情報ページの集合を取得するにあたって、Ｈｕ
ｂリンクリスト中のリンクから、あらかじめ定められた
段数だけリンクをたどり、たどられたリンクの指す情報
ページを全て取得して情報ページ集合とし、該情報ペー
ジ集合をもって対象単語と関連の深い情報ページの集合
とするページ取得プログラムを有する。[0010] Further, a recording medium on which the information collection program of the present invention is recorded is a recording medium on which an information collection program for acquiring a set of information pages from a network when a target word as a certain keyword is given. In order to obtain a link candidate list indicating a Hub page in which links of information pages describing the target indicated by the target word are obtained, the search is performed for each of one or more predetermined search words. A link that constitutes a set of search pairs in which two words, a word and a target word, are combined, and obtains a link candidate list in which information pages on the network are linked to information pages that are related to at least one of the search pairs. A Hub consisting of only a link of the Hub page from the candidate acquisition program and the link candidate list When taking out a Nkurisuto,
For all links in the link candidate list, an information page pointed to by the link is acquired, and when the number of links on the information page is larger than a predetermined number, the link of the information page is changed to a Hub link list. To obtain a set of information pages from a Hub link list,
b From the links in the link list, follow the link by a predetermined number of steps, acquire all the information pages pointed to by the followed link to form an information page set, and use the information page set to make an information page closely related to the target word. Page acquisition program.

【００１１】要約すると、本発明の情報ページ収集方法
は、リンク候補取得ステップと、Ｈｕｂページ判定ステ
ップと、ページ取得ステップの三つのステップを用意す
る。リンク候補取得ステップは、あらかじめ定められた
一つ以上の検索単語のそれぞれについて、該検索単語と
対象単語の二つの単語を組にした検索組の集合を構成す
る。そして、インターネット上のＷｅｂページについ
て、検索組の少なくとも一つと関連のあるＷｅｂページ
のリンクを並べたリストを取得する。In summary, the information page collection method of the present invention includes three steps: a link candidate acquisition step, a Hub page determination step, and a page acquisition step. In the link candidate obtaining step, for each of one or more predetermined search words, a set of search groups in which two words of the search word and the target word are set is configured. Then, for Web pages on the Internet, a list is obtained in which links of Web pages related to at least one of the search groups are arranged.

【００１２】Ｈｕｂページ判定ステップは、リンク候補
リストの全てのリンクについて、該リンクの指すＷｅｂ
ページを取得し、該Ｗｅｂページに掲載されているリン
クの個数があらかじめ定められた数より大きいときに、
該ＷｅｂページのリンクをＨｕｂリンクリストに追加す
る。[0012] In the Hub page determination step, for all links in the link candidate list, the Web pointed to by the links is referred to.
When a page is acquired and the number of links posted on the Web page is larger than a predetermined number,
The link of the Web page is added to the Hub link list.

【００１３】ページ取得ステップは、Ｈｕｂリンクリス
ト中のリンクから、あらかじめ定められた段数だけリン
クをたどり、たどられたリンクの指すＷｅｂページを全
て取得して情報ページ集合とし、該情報ページ集合をも
って対象単語と関連の深いＷｅｂページの集合とする。In the page obtaining step, the link is followed by a predetermined number of steps from the links in the Hub link list, all the Web pages pointed to by the followed link are obtained as an information page set, and the information page set is obtained. It is a set of Web pages closely related to the target word.

【００１４】したがって、本発明のリンク侯補取得ステ
ップは、Ｈｕｂページに関連する単語を集めた検索単語
と対象単語を合わせて検索を行う。これにより、Ｈｕｂ
ページを指すリンクの候補を取得することができる。Therefore, in the link candidate obtaining step of the present invention, a search is performed by combining a search word obtained by collecting words related to the Hub page with the target word. Thereby, Hub
A candidate for a link pointing to the page can be obtained.

【００１５】次に、Ｈｕｂページ判定ステップは、「掲
載されているリンクの個数が多い」というＨｕｂページ
の特徴に着目して、集められたＷｅｂページがＨｕｂペ
ージか否かを判定することができる。これにより、リン
ク候補リストからＨｕｂページのリンクのみからなるリ
ストを取り出すことができる。Next, in the Hub page determination step, it is possible to determine whether or not the collected Web pages are Hub pages by paying attention to the feature of the Hub page that “the number of the posted links is large”. . As a result, a list including only the links of the Hub pages can be extracted from the link candidate list.

【００１６】Ｈｕｂページに掲載されているリンクは、
対象単語と関連の深いＷｅｂページのリンクを集めたも
のである。ページ取得ステップは、こうしたＨｕｂペー
ジに掲載されているリンクの指すＷｅｂページを集め
る。これにより、対象単語と関連の深いＷｅｂページを
集めることができる。The link posted on the Hub page is
This is a collection of Web page links closely related to the target word. The page acquisition step collects Web pages indicated by the links posted on these Hub pages. As a result, Web pages closely related to the target word can be collected.

【００１７】以上のように、本発明によれば、リンク候
補取得ステップと、Ｈｕｂページ判定ステップと、ペー
ジ取得ステップの三つのステップを用いて、対象単語と
関連の深いＷｅｂページを集めることができる。As described above, according to the present invention, Web pages closely related to a target word can be collected using the three steps of the link candidate acquisition step, the Hub page determination step, and the page acquisition step. .

【００１８】[0018]

【発明の実施の形態】次に、本発明の実施の形態につい
て、図面を参照して詳細に説明する。図１は、本発明の
情報収集方法の実施例を説明するための図であり、イン
ターネット上における本実施例によるＷｅｂページ収集
方法のステップ構成図である。Next, an embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a diagram for explaining an embodiment of an information collecting method according to the present invention, and is a block diagram of a Web page collecting method according to the present embodiment on the Internet.

【００１９】図１によれば、本発明の情報ページ収集方
法は、リンク候補（ＵＲＬ候補）取得ステップ１１と、
Ｈｕｂページ判定ステップ１２と、ページ取得ステッ
プ１３の３ステップから構成されている。Referring to FIG. 1, the information page collecting method of the present invention comprises a link candidate (URL candidate) acquiring step 11;
It comprises three steps: a Hub page determination step 12 and a page acquisition step 13.

【００２０】リンク候補取得ステップ１１は、キーワー
ドである対象単語の示す対象について記述されたＷｅｂ
ページのリンクを集めたＨｕｂページの候補となるＷｅ
ｂページを取得する。こうしたステップは、例えば、検
索組構成処理とＷｅｂページ検索処理の二つの処理を施
すステップとして構成する。The link candidate obtaining step 11 includes a Web page describing a target indicated by the target word which is a keyword.
Web that is a candidate for a Hub page that collects page links
Get page b. These steps are configured as, for example, steps of performing two processes of a search group configuration process and a Web page search process.

【００２１】ここで、検索組構成処理は、検索単語のそ
れぞれについて、該検索単語と対象単語の二つの単語を
組にした検索組の集合を構成する。検索組構成処理は、
このようにして検索組の集合を構成する。Ｗｅｂページ
検索処理は、検索組構成処理から受け取った検索組につ
いて、各検索組と関連の深いＷｅｂページの検索を行
う。こうしたＷｅｂページ検索処理は、例えば、次のよ
うな処理として実現することができる。Here, the search set construction process forms, for each of the search words, a set of search sets in which two words of the search word and the target word are combined. The search group composition process
In this way, a set of search sets is formed. In the Web page search process, a search for a Web page closely related to each search group is performed for the search group received from the search group configuration process. Such a Web page search process can be realized, for example, as the following process.

【００２２】まず、各検索組について、該検索組に含ま
れる二つの単語が共に含まれるテキストをＷｅｂページ
を集めたデータベース（以下、Ｗｅｂデータベース）か
ら取り出す。そして、各検索組について取り出されたＷ
ｅｂページを合わせてＨｕｂ候補ページとする。Ｗｅｂ
ページ検索処理は、以上のようにして、Ｈｕｂページの
候補であるＷｅｂページを集めることができる。First, for each search set, a text containing both words included in the search set is extracted from a database (hereinafter, a Web database) that collects Web pages. Then, W extracted for each search set
The web pages are combined into a Hub candidate page. Web
In the page search process, Web pages that are candidates for Hub pages can be collected as described above.

【００２３】Ｗｅｂページ検索処理で、検索組からＷｅ
ｂページを取り出す処理は、例えば、次の第一次検索処
理と第二次検索処理の二つの処理により実現される。こ
こで、第一次検索処理は、検索組の一つ目の単語をキー
ワード１として、該キーワード１が含まれるテキストの
みをＷｅｂデータベースから取り出す。この取り出しに
あたって、第一次検索処理は、Ｗｅｂページを入力文字
列とする。そして、該入力文字列の１つ目の文字から取
り出し、キーワード１の第１文字目との照合を行う。該
照合が成功すれば、入力文字列について、該照合の成功
した箇所の次に位置する文字とキーワード１の第２文字
目との照合を行う。こうした照合をキーワード１の全て
の文字について行い、キーワード１の全ての文字が照合
に成功すれば、該Ｗｅｂページは、キーワード１を含む
と判定する。以上のようにして、第一次検索処理は、キ
ーワード１が含まれるテキストのみをＷｅｂデータベー
スから取り出す。第二次検索処理は、検索組の二つ目の
単語をキーワード２とする。そして、第一次検索処理に
より取り出されたＷｅｂページから、キーワード２が含
まれるテキストを取り出す。こうした取り出しは、第一
次検索処理と同様に実現することができる。リンク侯補
取得ステップは、例えば、以上のような検索組構成処理
とＷｅｂページ検索処理の二つの処理を施すステップと
して構成する。In a Web page search process, a Web page is searched from a search group.
The process of retrieving page b is realized by, for example, the following two processes of a primary search process and a secondary search process. Here, in the primary search processing, the first word of the search set is set as keyword 1, and only the text including the keyword 1 is extracted from the Web database. In this retrieval, the primary search process uses a Web page as an input character string. Then, the character string is extracted from the first character of the input character string and collated with the first character of keyword 1. If the collation is successful, the input character string is collated with the next character of the keyword 1 and the second character of keyword 1. Such collation is performed for all characters of keyword 1, and if all characters of keyword 1 are successfully collated, it is determined that the Web page includes keyword 1. As described above, in the primary search processing, only the text including the keyword 1 is extracted from the Web database. In the secondary search process, the second word of the search set is set as keyword 2. Then, a text including the keyword 2 is extracted from the Web page extracted by the primary search process. Such retrieval can be realized in the same manner as the primary search processing. The link candidate acquisition step is configured as, for example, a step of performing the two processes of the search group configuration process and the Web page search process as described above.

【００２４】以上のようなリンク候補取得ステップ１１
は、例えば、インターネットで提供されているネット検
索エンジンを用いても実現することができる。こうした
ネット検索エンジンでは、入力された対象単語を含むＷ
ｅｂページのリンクを検索することができる。こうした
検索は、基本的には、上に述べたように、単語の出現に
着目して実現されている。こうしたネット検索エンジン
を利用して、検索組に含まれた単語が共に出現するＷｅ
ｂページのリンクを取得し、こうして得られたリンクの
指すＷｅｂページを取得することにより実現することが
できる。リンク候補取得ステップ１１は、例えば、この
ようにネット検索エンジンを利用したステップとして構
成することができる。以上のようにして、リンク候補取
得ステップ１１は、Ｈｕｂページの候補を集めるように
構成することができる。Link candidate acquisition step 11 as described above
Can be realized using, for example, a net search engine provided on the Internet. In such a net search engine, W
You can search for links on web pages. Such a search is basically realized by focusing on the appearance of a word, as described above. Using such a net search engine, a word in which a word included in a search group appears together
This can be realized by acquiring the link of page b and acquiring the Web page indicated by the link thus obtained. The link candidate acquisition step 11 can be configured as, for example, a step using a net search engine as described above. As described above, the link candidate acquisition step 11 can be configured to collect Hub page candidates.

【００２５】Ｈｕｂページ判定ステップ１２は、リンク
候補取得ステップ１１により集められたＨｕｂページの
候補からＨｕｂページを取り出す。こうしたステップ
は、例えば、リンク抽出処理と外部リンク判定処理と外
部リンク数評価処理の三つの処理を施すステップとして
構成する。The Hub page determination step 12 extracts a Hub page from the Hub page candidates collected in the link candidate acquisition step 11. These steps are configured as, for example, steps of performing three processes of a link extraction process, an external link determination process, and an external link number evaluation process.

【００２６】ここで、リンク抽出処理は、Ｈｕｂページ
の候補のそれぞれについて、該Ｈｕｂページに掲げられ
ているリンクのリストを抽出する。こうしたリンク抽出
処理は、ＨＴＭＬのｈｒｅｆタグに着目することにより
実現することができる。Ｗｅｂページを記述するＨＴＭ
Ｌ言語では、リンク先を示すタグとしてｈｒｅｆタグが
用いられる。したがって、こうしたｈｒｅｆタグで囲ま
れる部分を抜き出すことにより、リンク名を取り出すこ
とができる。リンク抽出処理は、以上のようにして、Ｈ
ｕｂページの候補のそれぞれから、リンクのリストを取
り出すことができる。Here, the link extraction process extracts, for each of the candidates for the Hub page, a list of links listed on the Hub page. Such a link extraction process can be realized by focusing on the HTML href tag. HTM that describes Web pages
In the L language, an href tag is used as a tag indicating a link destination. Therefore, the link name can be extracted by extracting the portion surrounded by the href tag. The link extraction processing is performed as described above.
A list of links can be extracted from each of the candidates for the ub page.

【００２７】外部リンク判定処理は、リンク抽出処理に
より取り出されたリンクについて、それが外部リンクか
内部リンクかを判定する。ここで、外部リンクとは、リ
ンクが、該リンクの提供されるＷｅｂページ以外のサイ
トを指しているリンクである。また、内部リンクとは、
リンクが、該リンクの提供されるＷｅｂページと同一の
サイトを指しているリンクである。こうした判定は、リ
ンクの形式がｈｔｔｐ：／／で始まるリンクであるか、
あるいは、単にディレクトリ名で始まっているかによっ
て、判定することができる。外部リンク判定処理は、以
上のようにして、各リンクについて、該リンクが外部リ
ンクか内部リンクかを判定することができる。The external link determination processing determines whether the link extracted by the link extraction processing is an external link or an internal link. Here, the external link is a link in which the link points to a site other than the Web page to which the link is provided. Also, an internal link is
The link is a link pointing to the same site as the Web page to which the link is provided. These determinations are based on whether the link format is a link starting with http: //
Alternatively, it can be determined by simply starting with a directory name. The external link determination process can determine whether each link is an external link or an internal link as described above.

【００２８】外部リンク数評価処理は、各Ｈｕｂページ
の候補について、該候補のページがＨｕｂサイトである
かどうかを判定する。まず、外部リンク数評価処理は、
各Ｈｕｂページの候補について、該候補のページに掲げ
られる外部リンク数をカウントする。こうしたカウント
処理は、リンク抽出処理により抽出されたリンクのリス
トと、外部リンク判定処理により判定された各リンクに
ついての判定結果を用いて容易に実現することができ
る。次に、外部リンク数評価処理は、外部リンクの数が
あらかじめ定められた値よりも大きい候補のページをＨ
ｕｂページであると判定する。逆に、外部リンクの数が
あらかじめ定められた値を越えない候補のページをＨｕ
ｂページでないと判定する。以上のようにして、外部リ
ンク数評価処理は、各Ｈｕｂページの候補について、該
候補のページがＨｕｂサイトであるかどうかを判定する
ことができる。The external link number evaluation process determines, for each Hub page candidate, whether the candidate page is a Hub site. First, the external link number evaluation process
For each Hub page candidate, the number of external links listed on the candidate page is counted. Such a count process can be easily realized using a list of links extracted by the link extraction process and a determination result for each link determined by the external link determination process. Next, in the external link number evaluation processing, candidate pages whose number of external links is larger than a predetermined value are set to H.
It is determined that the page is a web page. Conversely, candidate pages whose number of external links does not exceed a predetermined value are set to Hu.
It is determined that the page is not page b. As described above, the external link number evaluation process can determine, for each Hub page candidate, whether the candidate page is a Hub site.

【００２９】Ｈｕｂページ判定ステップ１２は、以上の
ように、リンク抽出処理と外部リンク判定処理と外部リ
ンク数評価処理の三つの処理を施すステップとして構成
することができる。As described above, the Hub page determination step 12 can be configured as a step of performing three processes of the link extraction process, the external link determination process, and the external link number evaluation process.

【００３０】ページ取得ステップ１３は、Ｈｕｂペー
ジに掲げられるリンクを用いて、対象単語と関連の深い
Ｗｅｂページを取得する。こうしたステップは、例え
ば、リンク抽出処理とリンク先取得処理の二つの処理を
あらかじめ定められた回数だけ繰り返すステップとして
構成する。ここで、リンク抽出処理は、Ｈｕｂページ判
定ステップ１２と同様に実現することができる。ページ
取得ステップ１３でのリンク抽出処理は、まず、Ｈｕ
ｂページと判定されたページのリンクを全て抜き出す。
そして、リンク先取得処理は、リンク抽出処理によって
抜き出されたリンクの指すＷｅｂページを取得し、これ
を２次Ｈｕｂページとする。２次Ｈｕｂページが取得さ
れると、リンク抽出処理は、２次Ｈｕｂページのそれぞ
れについて、掲げられているリンクを全て抜き出す。リ
ンク先取得処理は、再び、リンク抽出処理によって抜き
出されたリンクの指すＷｅｂページを取得し、これを３
次Ｈｕｂページとする。ページ取得ステップ１３は、以
上二つの処理をあらかじめ定められた回数だけ繰り返
す。そして、以上の処理によって、リンク先取得処理か
ら得られた全てのＷｅｂページをもって、対象語と関連
の深いＨｕｂページとする。以上のように、ページ取得
ステップ１３は、リンク抽出処理とリンク先取得処理の
二つの処理をあらかじめ定められた回数だけ繰り返すス
テップとして構成することができる。The page obtaining step 13 obtains a Web page that is deeply related to the target word by using a link listed on the Hub page. Such a step is configured as, for example, a step of repeating two processes of a link extraction process and a link destination acquisition process a predetermined number of times. Here, the link extraction processing can be realized in the same manner as in the Hub page determination step 12. The link extraction process in the page acquisition step 13 is first performed by Hu
Extract all links on the page determined to be page b.
Then, the link destination acquisition processing acquires a Web page indicated by the link extracted by the link extraction processing, and sets this as a secondary Hub page. When the secondary Hub page is obtained, the link extraction process extracts all listed links for each of the secondary Hub pages. In the link destination acquisition process, a Web page indicated by the link extracted by the link extraction process is acquired again.
The next Hub page is set. The page acquisition step 13 repeats the above two processes a predetermined number of times. Then, by the above processing, all the Web pages obtained from the link destination obtaining processing are set as the Hub pages closely related to the target word. As described above, the page acquisition step 13 can be configured as a step of repeating the two processes of the link extraction process and the link destination acquisition process a predetermined number of times.

【００３１】以上の三つのステップによって、対象単語
と関連の深いＷｅｂページを取得することができる。By the above three steps, a Web page closely related to the target word can be obtained.

【００３２】次に、本発明の情報収集プログラムを記録
した記録媒体の実施の形態について図面を参照して説明
する。Next, an embodiment of a recording medium on which the information collection program of the present invention is recorded will be described with reference to the drawings.

【００３３】図２は本発明の情報収集プログラムを対象
単語と関連の深いＷｅｂページをインターネットから取
得する計算機の構成を示す図である。入力装置２１は、
インターネットにアクセスし、対象単語と検索単語から
検索組を構成するため検索単語のそれぞれについて、対
象単語と合わせて一つの検索組を構成させる入力装置で
ある。記憶装置２２は、記録媒体２４から情報収集プロ
グラムの各種プログラム、テーブルを読み込み、演算処
理のためデータ処理装置２５に渡し、データ処理装置２
５からのデータを記憶する記憶装置である。出力装置２
３は、情報収集プログラムの実行処理に際し、インター
ネットにアクセスし、対象単語と検索単語の検索組の入
力操作と、各処理プログラムの過程を表示するためのデ
ィスプレイ装置、記録を印字するプリンタ装置などの出
力装置である。記憶媒体２４は、本発明の情報収集プロ
グラムを記録したフロッピー（登録商標）ディスク、Ｃ
Ｄ−ＲＯＭ、光磁気ディスク、半導体メモリなどの記憶
媒体である。データ処理装置２５は、記録媒体２４から
情報収集プログラムと入力装置２１からの対象単語と検
索単語から検索組を読み込んで、対象単語と関連の深い
Ｗｅｂページをインターネットから取得するプログラム
を実行するＣＰＵである。通信装置２６は、インターネ
ットにアクセスし、インターネットへの通信制御を行う
モデム、ＤＳＵなどの通信装置である。FIG. 2 is a diagram showing a configuration of a computer which acquires a Web page closely related to a target word from the Internet using the information collection program of the present invention. The input device 21
An input device that accesses the Internet and configures one search group together with the target word for each of the search words to form a search group from the target word and the search word. The storage device 22 reads various programs and tables of the information collection program from the recording medium 24 and passes them to the data processing device 25 for arithmetic processing.
5 is a storage device for storing data from the storage device 5. Output device 2
Reference numeral 3 denotes a display device for accessing the Internet, inputting a search set of a target word and a search word, displaying a process of each processing program, a printer device for printing a record, and the like when executing the information collection program. Output device. The storage medium 24 is a floppy (registered trademark) disk on which the information collection program of the present invention is recorded, C
It is a storage medium such as a D-ROM, a magneto-optical disk, and a semiconductor memory. The data processing device 25 reads the information collection program from the recording medium 24 and the search group from the target word and the search word from the input device 21 and executes a program for acquiring a Web page closely related to the target word from the Internet. is there. The communication device 26 is a communication device such as a modem or a DSU that accesses the Internet and controls communication with the Internet.

【００３４】記憶媒体２４に記録された本発明の情報収
集プログラムは、リンク候補取得プログラム２４１と、
Ｈｕｂページ判定プログラム２４２と、ページ取得プロ
グラム２４３から構成されている。The information collection program of the present invention recorded on the storage medium 24 includes a link candidate acquisition program 241 and
It comprises a Hub page determination program 242 and a page acquisition program 243.

【００３５】リンク候補取得プログラム２４１は、あら
かじめ定められた一つ以上の検索単語のそれぞれについ
て、検索単語と対象単語の二つの単語を組にした検索組
の集合を構成し、ネットワーク上の情報ページについ
て、検索組の少なくとも一つと関連のある情報ページの
リンクをならべたリンク候補リストを取得するプログラ
ムである。The link candidate acquisition program 241 configures a set of a search group in which two words, a search word and a target word, are set for each of one or more predetermined search words, and the information page on the network. Is a program for obtaining a link candidate list in which links of information pages related to at least one of the search sets are arranged.

【００３６】Ｈｕｂページ判定プログラム２４２は、リ
ンク候補リストの全てのリンクについて、そのリンクの
指す情報ページを取得し、その情報ページに掲載されて
いるリンクの個数があらかじめ定められた数より大きい
ときに、その情報ページのリンクをＨｕｂリンクリスト
に追加するプログラムである。The Hub page determination program 242 obtains, for all links in the link candidate list, information pages indicated by the links, and when the number of links on the information page is larger than a predetermined number. , A program for adding the link of the information page to the Hub link list.

【００３７】ページ取得プログラム２４３は、Ｈｕｂリ
ンクリスト中のリンクから、あらかじめ定められた段数
だけリンクをたどり、たどられたリンクの指す情報ペー
ジを全て取得して情報ページ集合とし、その情報ページ
集合をもって対象単語と関連の深い情報ページの集合と
するプログラムである。The page acquisition program 243 follows a predetermined number of links from the links in the Hub link list, acquires all the information pages indicated by the visited links, and sets them as an information page set. Is a set of information pages closely related to the target word.

【００３８】[0038]

【実施例】以下では、図３に示す対象単語、検索単語の
例と、図４に示すＷｅｂぺージの例のもとに、上記ステ
ップの収集動作について詳しく説明する。Hereinafter, the collecting operation of the above steps will be described in detail based on an example of a target word and a search word shown in FIG. 3 and an example of a Web page shown in FIG.

【００３９】図３に示す例では、対象単語が｛デジタル
カメラ｝、検索単語が｛ランキング、リスト、リンク
集｝である。まず、リンク候補取得ステップは、対象単
語と検索単語から検索組を構成する。検索組の構成にあ
たっては、検索単語のそれぞれについて、対象単語と合
わせて一つの検索組とする。この結果、図３に示す例で
は、｛デジタルカメラ、ランキング｝、｛デジタルカメ
ラ、リスト｝、｛デジタルカメラ、リンク集｝、という
三つの検索組が得られる。リンク候補取得ステップは、
こうして得られた検索組をもとに各検索組と関連の深い
Ｗｅｂページの検索を行う。この検索の動作を説明する
にあたって、図４に示す二つのＷｅｂページを例に用い
る。まず、リンク候補取得ステップは、一つ目の検索組
｛デジタルカメラ、ランキング｝を取り出す。さらに、
該検索組の第１項にある「デジタルカメラ」をキーワー
ド１として取り出す。そして、キーワード１を含むＷｅ
ｂページを取り出す。この結果、図４に示す二つのＷｅ
ｂページが二つとも取り出されることになる。リンク候
補取得ステップは、次に、検索組の第２項にある「ラン
キング」をキーワード２として取り出す。そして、キー
ワード１を含むＷｅｂページを対象に、キーワード２を
含むＷｅｂページを取り出す。図４に示すＷｅｂページ
１は「ランキング」という言葉を含むのに対して、Ｗｅ
ｂページ２は「ランキング」という言葉を含まない。し
たがって、Ｗｅｂページ１のみが取り出されることにな
る。Ｈｕｂページには、「ランキング」、「比較」、
「リスト」、「リンク集」など、対象の列挙に関連する
語が使われることが多い。したがって、こうした単語を
検索語として用意することにより、Ｈｕｂページを取
り出すことが可能となる。In the example shown in FIG. 3, the target word is {digital camera}, and the search word is {ranking, list, link collection}. First, the link candidate acquisition step forms a search group from the target word and the search word. In constructing a search set, each search word is combined with the target word into one search set. As a result, in the example shown in FIG. 3, three search sets of {digital camera, ranking}, {digital camera, list}, {digital camera, link collection} are obtained. The link candidate acquisition step includes:
Based on the search sets obtained in this way, a search for a Web page closely related to each search set is performed. In describing the search operation, two Web pages shown in FIG. 4 will be used as an example. First, in the link candidate acquisition step, the first search group {digital camera, ranking} is extracted. further,
The “digital camera” in the first term of the search set is extracted as the keyword 1. Then, the We including the keyword 1
Take out page b. As a result, the two We shown in FIG.
Both b pages will be fetched. Next, the link candidate obtaining step extracts “ranking” in the second term of the search set as keyword 2. Then, a Web page including the keyword 2 is extracted from the Web page including the keyword 1. Web page 1 shown in FIG. 4 includes the word “ranking”, while Web page 1
b page 2 does not include the word "ranking". Therefore, only Web page 1 is extracted. On the Hub page, "ranking", "comparison",
Words related to the enumeration of the object, such as "list" and "link collection", are often used. Therefore, by preparing such a word as a search word, it is possible to extract the Hub page.

【００４０】Ｈｕｂページ判定ステップは、リンク候補
取得ステップにより集められたＨｕｂページの候補か
ら、Ｈｕｂページを取り出す。Ｈｕｂページは、多くの
リンクを集めたページであるので、そのページの中に、
外部リンクが多く掲げられている。したがって、この外
部リンクの数を調べることにより、Ｈｕｂページかど
うかの判定が可能となる。Ｈｕｂページ判定ステップ
は、まず、ページ内のリンクのリストを抽出する。この
過程を図４に示すＷｅｂページ１を例に説明する。Ｈｕ
ｂページ判定ステップは、まず、「＜ａｈｒｅｆ＝」と
いう記号でページを分割する。この結果、Ｗｅｂページ
１は、図５の中央に示す６つの部分に分割される。そし
て、一つ目の＜ＨＴＭＬ」＞＜ＨＥＡＤ＞‥‥＜Ｔａｂ
ｌｅ＞＜ｔｒ＞＜ｔｄ＞の部分を対象外とする。そし
て、二つ目以降の部分について、各部分の最初の“＞”
の部分までを抽出する。以上のようにして、Ｗｅｂペー
ジからリンク先を取り出すことができる。この例の場
合、リンクの数が５つということになる。ここで、例え
ば、リンク数が５以上のＷｅｂページをＨｕｂページと
して判定すると定めておくと、Ｗｅｂページ１をＨｕｂ
ページとして取り出すことができる。このリンク数のし
きい値は、Ｈｕｂページを多く見つけたいときには小さ
めの値、それほど多くみつけなくて良いときは大きめの
値として設定する。Ｈｕｂページ判定ステップは、以上
のようにして、各ＷｅｂページがＨｕｂページであるか
どうかを判定することができる。The Hub page determination step extracts a Hub page from the Hub page candidates collected in the link candidate acquisition step. The Hub page is a page that has many links, so in that page,
Many external links are listed. Therefore, by examining the number of external links, it is possible to determine whether or not the page is a Hub page. In the Hub page determination step, first, a list of links in the page is extracted. This process will be described using Web page 1 shown in FIG. 4 as an example. Hu
In the b-page determination step, first, the page is divided by a symbol “<ahref =”. As a result, Web page 1 is divided into six parts shown in the center of FIG. Then, the first <HTML >><HEAD> ‥‥ <Tab
le><tr><td> are excluded. Then, for the second and subsequent parts, the first ">" of each part
Extract up to the part. As described above, the link destination can be extracted from the Web page. In the case of this example, the number of links is five. Here, for example, if it is determined that a Web page having five or more links is determined as a Hub page, Web page 1 is determined as a
Can be retrieved as a page. The threshold value of the number of links is set as a small value when it is desired to find many Hub pages, and as a large value when it is not necessary to find so many Hub pages. The Hub page determination step can determine whether each Web page is a Hub page as described above.

【００４１】ページ取得ステップは、Ｈｕｂページ判定
ステップによって取り出されたリンクを用いて、Ｗｅｂ
ページを取得することができる。このページ取得ステッ
プは、Ｈｕｂページ判定ステップで述べたリンクの取り
出しと同様の処理を行い、あらかじめ定められた回数だ
けリンクをたどる。リンクをたどる回数については、多
くのテキストが必要であるときは回数を大きい値に設定
し、少ないテキストでよい場合は回数を小さい値に設定
する。また、回数を大きい値に設定した場合には、収集
に多くの時間を要することになるので、短い時間で収集
を行いたい場合には、回数を小さい値に設定する。ペー
ジ取得ステップは、以上のように、リンクをたどると同
時に、たどられたリンクのページを取得する。ページ取
得ステップは、以上のようにして、Ｗｅｂページを取得
することができる。The page acquisition step uses the link extracted in the Hub page determination step to generate a Web page.
You can get the page. This page acquisition step performs the same processing as the link extraction described in the Hub page determination step, and follows the link a predetermined number of times. Regarding the number of times to follow a link, the number is set to a large value when a large amount of text is required, and the number is set to a small value when a small number of texts are sufficient. Further, if the number of times is set to a large value, it takes a lot of time to collect, so if it is desired to perform collection in a short time, the number of times is set to a small value. In the page acquisition step, as described above, the page of the followed link is acquired at the same time as following the link. The page acquisition step can acquire a Web page as described above.

【００４２】以上のように、本発明では、Ｗｅｂページ
を収集することができる。As described above, according to the present invention, Web pages can be collected.

【００４３】[0043]

【発明の効果】以上説明したように、本発明では、対象
単語と関連の深いＷｅｂページを収集するにあたって、
Ｈｕｂページに掲げられるリンクを利用する。こうした
Ｈｕｂページは、人間の手によって厳選された重要なリ
ンクが集められたページである。したがって、ネット検
索エンジンで重要なリンクを自動で集める従来法に比べ
て、厳選された少ないリンクから適切なＷｅｂページを
取得することが可能となる。以上のように、本発明によ
りれば、対象単語と関連の深いＷｅｂページをより少な
いリンクから収集することができると言う効果がある。As described above, according to the present invention, when collecting Web pages closely related to a target word,
Use the links listed on the Hub page. Such a Hub page is a page in which important links carefully selected by human hands are collected. Therefore, it is possible to acquire an appropriate Web page from a small number of carefully selected links as compared with a conventional method in which important links are automatically collected by a net search engine. As described above, according to the present invention, there is an effect that a Web page closely related to a target word can be collected from fewer links.

[Brief description of the drawings]

【図１】本発明の情報収集方法の実施例を説明するため
の図である。FIG. 1 is a diagram for explaining an embodiment of an information collecting method according to the present invention.

【図２】本発明の情報収集プログラムを利用してインタ
ーネットからＷｅｂページを取得する計算機の構成を示
す図である。FIG. 2 is a diagram illustrating a configuration of a computer that acquires a Web page from the Internet using the information collection program of the present invention.

【図３】対象単語と検索単語の例を示した図である。FIG. 3 is a diagram showing an example of a target word and a search word.

【図４】Ｗｅｂページの例を示した図である。FIG. 4 is a diagram illustrating an example of a Web page.

【図５】情報リンクリストの取得処理例を示した図であ
る。FIG. 5 is a diagram illustrating an example of an information link list acquisition process.

[Explanation of symbols]

１１リンク候補（ＵＲＬ候補）取得ステップ１２Ｈｕｂページ判定ステップ１３ページ取得ステップ２１入力装置２２記憶装置２３出力装置２４記録媒体２５データ処理装置２６通信装置 11 Link Candidate (URL Candidate) Acquisition Step 12 Hub Page Determination Step 13 Page Acquisition Step 21 Input Device 22 Storage Device 23 Output Device 24 Recording Medium 25 Data Processing Device 26 Communication Device

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B075 KK07 ND20 NK02 NK44 PP02 PP03 PP12 PP13 PP22 PQ02 PQ42 PQ46 QS20 UU40 5B082 EA00 EA01 GC04 ──────────────────────────────────────────────────続き Continued on the front page F term (reference) 5B075 KK07 ND20 NK02 NK44 PP02 PP03 PP12 PP13 PP22 PQ02 PQ42 PQ46 QS20 UU40 5B082 EA00 EA01 GC04

Claims

[Claims]

1. When a target word which is a certain keyword is given, an information page which is a hypertext closely related to the target word from a network which is a set of hypertexts in which a plurality of texts are linked by a link. In the information collection method for obtaining a set, a search for one or more predetermined words is performed in obtaining a link candidate list indicating a Hub page in which links of information pages describing the target indicated by the target word are collected. For each of the words, form a set of search sets each of which is a pair of the search word and the target word. For information pages on the network, link links of information pages related to at least one of the search sets. A link candidate obtaining step of obtaining a linked list of links, When retrieving a Hub link list consisting only of Hub page links from the supplementary list, an information page indicated by the link is obtained for all the links in the link candidate list, and the number of links described in the information page is determined in advance. A hub page determination step of adding a link of the information page to a hub link list when the number is larger than a predetermined number; and acquiring a set of information pages from the hub link list, from the links in the hub link list, A page acquisition step of following links by a predetermined number of steps, acquiring all information pages indicated by the followed links to form an information page set, and using the information page set as a set of information pages closely related to the target word. Information page collecting method characterized by having

2. When a target word which is a certain keyword is given, an information page which is a hypertext closely related to the target word from a network which is a set of hypertexts in which a plurality of texts are linked by a link. A storage medium storing an information collection program for obtaining a set, wherein a link candidate list indicating a Hub page that collects links of information pages describing a target indicated by the target word is obtained by a predetermined one. For each of the two or more search words, a set of search sets is formed by combining the search word and the target word, and the information page on the network is associated with at least one of the search sets. Get link candidate list that links the links of a certain information page When extracting a program and a Hub link list including only Hub page links from the link candidate list, an information page indicated by the link is obtained for all the links in the link candidate list, and the information page is posted on the information page. A hub page determination program for adding a link of the information page to a hub link list when the number of links is greater than a predetermined number; and a hub link list for acquiring a set of information pages from the hub link list. A set of information pages deeply related to the target word is obtained by tracing the links by a predetermined number of steps from the inside links, obtaining all the information pages pointed to by the tracing links, and forming an information page set. Information having a page acquisition program to be used Recording medium recording a current program.