JP2001290843A

JP2001290843A - Device and method for document retrieval, document retrieving program, and recording medium having the same program recorded

Info

Publication number: JP2001290843A
Application number: JP2001027163A
Authority: JP
Inventors: Hiroshi Tsuda; 宏津田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-02-04
Filing date: 2001-02-02
Publication date: 2001-10-19

Abstract

PROBLEM TO BE SOLVED: To automatically retrieve important document data. SOLUTION: This document retrieval device which retrieves document data from a document data group having link relation is equipped with a link importance imparting unit 21 and an access part 25. The link importance imparting unit 21 weights the link relation of document data 30 and imparts link importance 31 being a degree of importance to the document data 30 and the access part 25 accesses the document data 30 according to the link importance 31.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報処理装置に蓄
えられた大量の文書ファイル群を、文書の内容、文書の
リンク関係及び文書の登録された場所等に基づいて検索
する文書検索装置及びその方法並びに記録媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval apparatus for retrieving a large group of document files stored in an information processing apparatus based on the contents of the document, the link relation of the document, the registered location of the document, and the like. The present invention relates to a method and a recording medium.

【０００２】[0002]

【従来の技術】今日、コンピュータネットワークの発達
により、多量のオンライン文書情報があふれてきてい
る。これらの大量のオンライン文書情報を検索及び整理
するために、オンライン文書情報に索引をつけるサービ
スが提供されている。2. Description of the Related Art Today, with the development of computer networks, a large amount of online document information is overflowing. In order to search and organize such a large amount of online document information, a service for indexing online document information is provided.

【０００３】例えば、インターネットのＷＷＷ（Ｗｏｒ
ｌｄＷｉｄｅＷｅｂ）ページ検索において、ある有
名なサービスでは、ディレクトリサービスを提供してい
る。これは、階層化されたカテゴリ毎に、ＷＷＷページ
へのリンク集をまとめたものであり、以下の利点があ
る。１．カテゴリを選択する（クリックする）だけで、利用
者が閲覧を希望するＷＷＷページへのリンクが得られ
る。２．ＷＷＷページがカテゴリ毎にまとめられているた
め、文書検索の際に無闇に多くの情報が検索されない。３．人手を介してＷＷＷページをカテゴリに登録してい
るので、ゴミ、つまり、利用者が閲覧を希望している情
報と関係のない情報が紛れ込むことが少ない。For example, the Internet WWW (Wor
In ld Wide Web page search, one famous service provides a directory service. This is a collection of links to WWW pages for each hierarchical category, and has the following advantages. 1. By simply selecting (clicking) a category, a link to a WWW page that the user wants to browse can be obtained. 2. Since the WWW pages are organized for each category, a great deal of information is not searched for at the time of document search. 3. Since the WWW page is registered in the category through manual operation, garbage, that is, information irrelevant to the information that the user wants to browse, is less likely to enter.

【０００４】これら利点のため、このサービスはインタ
ーネットにおいて非常に広く利用されている。しかし、
このサービスは人手によりカテゴリの作成やＷＷＷペー
ジの管理をしているため、運用コストがかかるという問
題点がある。Because of these advantages, this service is very widely used on the Internet. But,
Since this service manually creates categories and manages WWW pages, there is a problem that operation costs are high.

【０００５】このディレクトリサービス全体の自動化に
あたっては、以下の解決すべき課題がある。１．重要文書の選定２．カテゴリ階層の管理、例えば、トピックの追加、削
除３．文書からカテゴリの割付：自動分類ここで、重要文書の選定について説明する。インターネ
ット、イントラネット共に、ＷＷＷページは増加の一途
にある。従って、似たような情報があちこちで別々に作
られたりもするので、文書に含まれる文字列（キーワー
ド）の有無や多少により検索しても結果が多すぎること
になる。延いては、この多くの検索結果のうちどの情報
が重要なのか分からないという問題が生じる。この重要
文書の選定に対する解決手段として以下のようなものが
取られてきている。１．検索要求を満たす順に検索結果を並び換えする、つ
まり、文書中に含まれる検索キーワードの数などに基づ
いて、検索結果をソートし、ランキング付けをする。２．検索結果の視覚化によるアクセス支援を行う、つま
り、検索結果の文書群を、内容の近いもの同士グルーピ
ングする（クラスタリング）。３．文書の属性、例えば、サイズ、作成日時等に基づい
て、検索結果をソートする。４．その他何らかの手段で付与された文書の重要度順で
ソートする、例えば、リンク関係、ユーザのアクセスロ
グ解析、第３者機関の作成した格付け等のメタデータに
基づいて、検索結果をソートする。[0005] In automation of the entire directory service, there are the following problems to be solved. 1. Selection of important documents 2. Management of category hierarchy, for example, addition and deletion of topics Assignment of categories from documents: automatic classification Here, selection of important documents will be described. The number of WWW pages is increasing on both the Internet and the intranet. Therefore, similar information may be separately created in various places, and a search based on the presence or absence of a character string (keyword) included in a document may result in too many results. As a result, there is a problem that it is not possible to determine which information is important among the many search results. The following measures have been taken as solutions to the selection of important documents. 1. The search results are sorted in an order satisfying the search request, that is, the search results are sorted and ranked based on the number of search keywords included in the document. 2. Access support is performed by visualizing the search results, that is, a group of documents of the search results are grouped together with similar contents (clustering). 3. The search results are sorted based on document attributes, for example, size, creation date and time. 4. In addition, the search results are sorted in the order of importance of the documents given by some means, for example, based on metadata such as link relationships, user access log analysis, and ratings created by a third party organization, and the like, and the search results are sorted.

【０００６】最後の一例として、近年ＷＷＷページのよ
うなハイパーテキストのリンク関係を利用した文書重要
度付与が、研究、サービスレベルで重要な技術となって
きている。ここで、リンク関係に基づくリンク重要度付
与の最も単純な実現は、「被リンク数の多い文書は重要
度が高い」という直観に基づくものである。As a final example, in recent years, document importance assignment using a link relationship of a hypertext such as a WWW page has become an important technology at a research and service level. Here, the simplest realization of the link importance assignment based on the link relationship is based on the intuition that “a document having a large number of linked pages has a high importance”.

【０００７】[0007]

【発明が解決しようとする課題】しかし、情報のナビゲ
ーションを容易にするために、一般に同一サーバ内のＷ
ＷＷページは、互いにリンク関係にあることが多い。例
えば、個人ＷＷＷページなどでは、「ＸＸのトップに戻
る」といった、ホームページトップへのリンクなどが多
かったりする。したがって、単純な被リンク数だけで
は、大量のページを抱えるサーバ（サイト）や個人が単
に量が多いというだけで重要度が高いことになるという
問題がある。また、検索システムが被リンク数で重要度
を出していることがわかれば、意味のないページ分割や
無駄なリンクだけのページを加えることで、故意に自分
のページの重要度をあげることができるという問題もあ
る。However, in order to facilitate the navigation of information, W in the same server is generally used.
WW pages are often linked to each other. For example, in a personal WWW page or the like, there are many links to a home page top such as "return to the top of XX". Therefore, there is a problem in that the server (site) or individual having a large number of pages has a high importance simply because the number of linked pages is merely large. Also, if you know that the search system gives importance to the number of linked pages, you can intentionally increase the importance of your page by adding meaningless page divisions or pages with only useless links. There is also a problem.

【０００８】また、http://www.elseviernl/cas/tree/s
tore/comnet/free/www7/02/com02.htmにおいて閲覧可能
であるＷＷＷページにおいて、「被リンク数の多い文書
は重要度が高い。」という直感の他に、「重要度が高い
文書からリンクされている文書は重要度が高い。」及び
「リンク先の少ないページからリンクされているページ
は重要度が高い。」という直感も示唆されている。[0008] Also, http: //www.elseviernl/cas/tree/s
In the WWW page that can be browsed at tore / comnet / free / www7 / 02 / com02.htm, in addition to the intuition that "documents with a large number of linked pages have high importance", "links from documents with high importance" The intuition suggests that the linked document has a high degree of importance "and that" a page linked from a page with a small link destination has a high degree of importance. "

【０００９】二番目の直観は、有名ディレクトリサービ
スで紹介されるＷＷＷページと、名もない個人のリンク
集で紹介されているＷＷＷページとでは、前者の方がよ
り重要という発見に基づいたものである。三番目の直感
は、例えば、1000ものリンク先を持つリンク集からリン
クされている文書よりも、50のリンク先を持つリンク集
からリンクされている文書の方が重要度が高いという考
えに基づく。これらの直観に基づいた重要度判定アルゴ
リズムにおいて、まず被リンク数により仮重要度を計算
し、仮重要度からリンク関係により重要度を更新し、・
・・という操作を収束するまで繰り返すことを行ってい
る。The second intuition is based on the discovery that the former is more important between a WWW page introduced in a famous directory service and a WWW page introduced in an unnamed personal link collection. is there. A third intuition is based on the idea that documents linked from a link collection with 50 links are more important than documents linked from a link collection with 1000 links, for example. . In the importance judgment algorithm based on these intuitions, first, the provisional importance is calculated based on the number of links, and the importance is updated based on the link relation from the provisional importance.
.. is repeated until the operation converges.

【００１０】しかし、このアルゴリズムによっても、大
量のページを抱えるサイトは被リンク数が多くなるため
有利であり、その結果、ページの重要度を算出した場合
に重要度の高いページの上位に似たようなサイトのペー
ジが並ぶという上述の問題がある。However, even with this algorithm, a site having a large number of pages is advantageous because the number of links is large, and as a result, when the importance of a page is calculated, it is similar to the top of the page with high importance. There is the above-mentioned problem that pages of such sites are lined up.

【００１１】ところで、利用者が検索をする際、検索に
用いるキーワードへアクセスするインターフェースが必
要となる。キーワードへのアクセスに関する従来技術と
して、かな漢字変換のインタフェースと関係が深い。When a user performs a search, an interface for accessing a keyword used for the search is required. As a conventional technique for accessing a keyword, it is closely related to an interface of kana-kanji conversion.

【００１２】例えば、特開平０３−２４１４５６によれ
ば、タッチパネル式のデバイスにより、かな漢字変換を
実現している。この技術によれば、画面上のソフトウェ
アキーボードにより、キーワードの読みを全て入れてか
ら、「変換」キーを押して漢字まじりに変換している。For example, according to Japanese Patent Application Laid-Open No. 03-241456, kana-kanji conversion is realized by a touch panel type device. According to this technique, all the readings of the keywords are entered by a software keyboard on the screen, and then the characters are converted into kanji characters by pressing a "convert" key.

【００１３】また、例えば、特開平１０−１５４１４
４、特開平１０−１５４０３３及びhttp://www.csl.son
y.co.jp/person/masui/POBox/index.htmにおいて閲覧可
能であるＷＷＷページによれば、ペン向きテキスト入力
システムを実現している。この技術によれば、画面上の
ソフトウェアキーボードから読みを入力するのである
が、読みの一部分を入力した時点で利用者の使用履歴に
基づき随時候補を出していく。Further, for example, Japanese Patent Application Laid-Open No. H10-15414
4, JP 10-154033 and http: //www.csl.son
According to the WWW page that can be browsed at y.co.jp/person/masui/POBox/index.htm, a pen-oriented text input system is realized. According to this technique, reading is input from a software keyboard on the screen. When a part of the reading is input, candidates are output as needed based on the usage history of the user.

【００１４】特開平０３−２４１４５６、特開平１０−
１５４１４４、特開平１０−１５４０３３及び上述のＷ
ＷＷページによれば、かな漢字変換するために常に1文
字ずつキーワードの読みや綴りを入力する必要があるた
め、長い文字列を入力するまた、例えば、次のような自
明な読み入力インタフェースもある。この技術の例とし
て、まず、「あ」「い」・・・といった、最初の文字毎
に、キーワードリストを作り、そのキーワードリストの
なかから利用者が選択する方式がある。しかし、この場
合、特定の読みから始まるリスト内のキーワード数が多
い場合、利用者がこのキーワードリストから特定のキー
ワードを選択することが困難になる。この方式の例とし
て、銀行の自動振込機が挙げられる。JP-A-03-241456, JP-A-10-104
154144, JP-A-10-154033 and the aforementioned W
According to the WW page, it is necessary to always input the reading and spelling of keywords one by one in order to convert kana-kanji characters. Therefore, there is also a trivial reading input interface for inputting a long character string, for example. As an example of this technique, there is a method in which a keyword list is first created for each of the first characters such as "A", "I", and so on, and a user selects from the keyword list. However, in this case, when the number of keywords in the list starting from a specific reading is large, it becomes difficult for the user to select a specific keyword from this keyword list. An example of this method is a bank transfer machine.

【００１５】自明な読み入力インタフェース技術のさら
なる例として、読みの文字を順次入力（ポインティング
デバイスの場合、クリック等）して、その読みから始ま
るキーワード数が決まった所でかな漢字変換された文字
を表示する方式もある。図３２は、読みの文字を順次入
力して、キーワード数が決まった所でかな漢字変換され
た文字を表示する方式を示す。図３２では「秋葉原」を
表示する場合を例として示す。図３２に示すように、５
０音の中から利用者は、キーワードの読みを順次入力す
る。「秋葉原」を表示させる場合、利用者は、まず
「あ」を入力し、次に「き」、「は」、「ば」「ら」と
順次入力する。「あきはばら」全ての音の入力が終了し
てキーワード数が決まったところで、かな漢字変換され
た文字「秋葉原」が表示される。この方式によっても、
長い読みのキーワードに辿りつくまで、多くの音を入力
する必要がある。As a further example of a self-explanatory reading input interface technology, reading characters are sequentially input (clicking in the case of a pointing device, etc.), and kana-kanji converted characters are displayed when the number of keywords starting from the reading is determined. There is also a method to do it. FIG. 32 shows a method of sequentially inputting the reading characters and displaying the Kana-Kanji converted characters when the number of keywords is determined. FIG. 32 shows an example in which “Akihabara” is displayed. As shown in FIG.
The user sequentially inputs the reading of the keyword from the zero sound. When displaying "Akihabara", the user first inputs "A", and then sequentially inputs "ki", "ha", "ba", and "ra". When all the sounds of "Akihabara" have been input and the number of keywords has been determined, the kana-kanji converted character "Akihabara" is displayed. Even with this method,
You need to input a lot of sounds until you reach a long-reading keyword.

【００１６】本発明は、重要文書の選定において、上述
の問題、つまり、多くのページを有する特定サイトに重
要度が偏るという問題を解消し、悪意を持った特定個人
が重要度を操作することを困難にする文書検索装置及び
方法を提供することを目的とする。The present invention solves the above-mentioned problem in the selection of important documents, that is, the problem that the importance is biased toward a specific site having many pages, and allows a malicious individual to control the importance. It is an object of the present invention to provide a document search apparatus and a method for making document search difficult.

【００１７】さらに、本発明は、検索キーワードをキー
ワードの読みに基づいて入力する場合に、より少ない入
力によりキーワードに到達でき、かつ、表示されるキー
ワードの候補や文書数を一定数以下にすることにより、
キーワード及び文書の選択を容易にする文書検索装置及
び方法を提供することを目的とする。Further, according to the present invention, when a search keyword is input based on the reading of a keyword, the keyword can be reached with less input, and the number of candidate keywords and documents to be displayed is set to a certain number or less. By
An object of the present invention is to provide a document search apparatus and method that facilitate selection of keywords and documents.

【００１８】さらに、本発明は、ディレクトリサービス
風のインターフェースを持ち、キーワードと関連する重
要文書、例えばＷＷＷページへ迅速にアクセスできるリ
ンク集を生成する装置及び方法を提供することを目的と
する。Still another object of the present invention is to provide an apparatus and a method for generating a link collection having a directory service-like interface and allowing quick access to important documents related to keywords, for example, WWW pages.

【００１９】[0019]

【課題を解決するための手段】本発明によれば、リンク
関係を有する文書データ群から文書データを検索する文
書検索装置であって、リンク関係に重みを付けた重要度
であるリンク重要度を文書データに付与するリンク重要
度付与手段と、リンク重要度に基づいて前記文書データ
にアクセスするアクセス手段を備える。According to the present invention, there is provided a document retrieval apparatus for retrieving document data from a document data group having a link relationship, wherein a link importance which is a weight given to the link relationship is determined. A link importance assigning means for assigning to the document data and an access means for accessing the document data based on the link importance are provided.

【００２０】多くのページからリンクされているページ
は重要であると考えられること、及び、少ないページに
リンクしているページからのリンクは、多くのページに
リンクしているページからのリンクよりも、重要である
等と考えられることに基づいて、リンク重要度付与手段
はリンク関係に重みを付けてリンク重要度を算出し、文
書データに付与する。アクセス手段はこの算出されたリ
ンク重要度に基づいて文書データにアクセスする。これ
により、重要な文書データを自動的に検索することを可
能とする。[0020] Pages linked from many pages are considered important, and links from pages that link to few pages are more important than links from pages that link to many pages. The link importance assigning means calculates the link importance by assigning weight to the link relation based on what is considered to be important, etc., and assigns it to the document data. The access means accesses the document data based on the calculated link importance. This makes it possible to automatically search for important document data.

【００２１】さらに、リンク重要度付与手段は、ＵＲＬ
類似度計算手段を備えてもよい。ＵＲＬ類似度計算手段
は、文書データを指し示すＵＲＬ（ＵｎｉｆｏｒｍＲ
ｅｓｏｕｒｃｅＬｏｃａｔｏｒ）間の類似度であるＵ
ＲＬ類似度を計算し、リンク重要度付与手段は、ＵＲＬ
間の類似度と文書間のリンク関係を用いてリンク重要度
を計算して文書データに付与する。Further, the link importance assigning means includes a URL
A similarity calculating means may be provided. The URL similarity calculating means calculates a URL (Uniform R) indicating the document data.
U, which is the similarity between source locators.
RL similarity is calculated, and the link importance assigning means calculates the URL
The link importance is calculated using the similarity between them and the link relation between the documents, and is added to the document data.

【００２２】例えば同じサイト内の文書データはリンク
関係を有することが多い。これら同じサイト内文書デー
タのＵＲＬは一般にＵＲＬ類似度が高い。ＵＲＬ類似度
が高い文書データからのリンクの重みをＵＲＬ類似度が
低い文書データからのリンクの重みよりも低くすること
により、多量の文書データを抱えるサイトの重要度が過
大に評価されることを防ぐ。延いては、より精度良く、
重要な文書データを検索することを可能とする。また、
リンク重要度を付与する際にＵＲＬ類似度を考慮するた
め、サイトの中の文書データのリンク数を増やしたりす
ることにより第三者が特定の文書の重要度を故意に上げ
ようとすることが難しくなるという効果もある。なお、
ＵＲＬ類似度は、サーバアドレス、パス、ファイル名を
含むＵＲＬの字面情報に基づいて決定されることとして
もよい。For example, document data in the same site often has a link relationship. In general, the URLs of the document data in the same site have a high URL similarity. By making the link weight from the document data having a high URL similarity lower than the link weight from the document data having a low URL similarity, the importance of a site having a large amount of document data is overestimated. prevent. After all, more accurate
It is possible to search for important document data. Also,
In order to consider the URL similarity when assigning the link importance, a third party may intentionally increase the importance of a specific document by increasing the number of links of the document data in the site. There is also an effect that it becomes difficult. In addition,
The URL similarity may be determined based on URL face information including a server address, a path, and a file name.

【００２３】また、文書検索装置は、更に文書データ中
からキーワードを抽出するキーワード抽出手段を更に備
えてもよい。また、文書検索装置は、文書関連度計算手
段を更に備え、上述のキーワード抽出手段は抽出された
キーワードの文書データ内での出現頻度を算出し、文書
関連度計算手段はリンク重要度及びキーワードの出現頻
度を用いてキーワードと文書データとの関連度を計算す
る。Further, the document search device may further comprise a keyword extracting means for extracting a keyword from the document data. In addition, the document search device further includes a document relevance calculating unit, wherein the keyword extracting unit calculates the frequency of appearance of the extracted keyword in the document data, and the document relevance calculating unit calculates the link importance and the keyword. The degree of association between the keyword and the document data is calculated using the appearance frequency.

【００２４】リンク重要度及び文書中のキーワードの出
現頻度を用いて関連度を計算し、関連度の高い文書デー
タを検索することにより、利用者が検索したいと望む文
書データと内容が関連している可能性の高い、重要な文
書を検索することを可能とする。The relevance is calculated using the link importance and the appearance frequency of the keyword in the document, and the document data having a high relevance is searched. It is possible to search for important documents that are likely to be present.

【００２５】また、文書検索装置は、利用者からのアク
セスを監視し、アクセスログを作成する監視手段をさら
に備え、キーワード文書関連度計算手段は、キーワード
出現頻度、リンク重要度に加えてアクセスログを用いて
関連度を計算することとしてもよい。関連度を算出する
際にアクセスログを用いることにより、よりキーワード
に関連した、より重要度の高い文書データを検索するこ
とが可能となる。Further, the document search device further comprises a monitor for monitoring access from a user and creating an access log. The keyword document relevance calculating means includes an access log in addition to a keyword appearance frequency and a link importance. May be used to calculate the degree of association. By using the access log when calculating the degree of relevance, it is possible to search for document data that is more relevant to the keyword and has higher importance.

【００２６】リンク重要度、キーワード出現頻度及びア
クセスログを関連度の計算に用いるため、故意に特定の
文書データの重要度を上げた場合であっても、そのよう
な文書データは検索されにくくなるという効果もある。Since the link importance, the keyword appearance frequency, and the access log are used for calculating the relevance, even if the importance of specific document data is intentionally increased, such document data is difficult to be searched. There is also an effect.

【００２７】また、文書検索装置は、ＵＲＬ類似度並び
に文書データのリンク数及び被リンク数を用いて文書デ
ータの文書タイプを判別する文書タイプ判別手段をさら
に備え、キーワード文書関連度計算手段は、文書タイプ
に基づいて文書データを選択し、選択された文書データ
について関連度を計算することとしてもよい。Further, the document search device further comprises a document type discriminating means for discriminating the document type of the document data by using the URL similarity and the number of links and the number of links of the document data. Document data may be selected based on the document type, and the degree of relevance may be calculated for the selected document data.

【００２８】文書データには、リンクページ、コンテン
ツページ等、数タイプある。これらは、リンク数及び被
リンク数によって判別することが可能である。この文書
タイプに基づいて、あるタイプの文書データ、例えば、
コンテンツページを選択し、選択された文書データにつ
いて関連度を計算することにより、より精度良く検索す
ることが可能となる。There are several types of document data, such as link pages and content pages. These can be determined based on the number of links and the number of links. Based on this document type, certain types of document data, for example,
By selecting a content page and calculating the degree of relevance of the selected document data, it is possible to search more accurately.

【００２９】また、文書検索装置は、抽出されたキーワ
ードの読みや綴りからキーワードへアクセスできる索引
を生成する索引生成手段を更に備えてもよい。さらに、
文書検索装置は、利用者が前記キーワードの読みや綴り
の断片を、前記索引から選択する選択手段を更に備えて
もよい。そして、索引生成手段は索引を生成する際に、
算出された関連度が高い順に、一定数以内の文書データ
を検索にいれる。アクセス手段は選択されたキーワード
の読みや綴りの断片に基づいて、文書データにアクセス
する。利用者は索引の中に含まれる文書データが一定数
にかぎられているために、索引から容易に選択すること
が可能となる。また、索引の中に含まれる文書データが
一定数にかぎられているため、携帯電話のような移動端
末においては表示画面の広さが限られている場合に有用
である。[0029] The document search device may further include index generation means for generating an index that can access the keyword from the reading or spelling of the extracted keyword. further,
The document search device may further include a selection unit that allows a user to select a reading or spelling fragment of the keyword from the index. Then, the index generating means generates the index,
Document data of a certain number or less is searched for in descending order of the calculated relevance. The access means accesses the document data based on the reading and spelling fragments of the selected keyword. The user can easily select from the index because the document data included in the index is limited to a certain number. Further, since the document data included in the index is limited to a certain number, it is useful for a mobile terminal such as a mobile phone when the size of the display screen is limited.

【００３０】また、文書検索装置はネットワークから文
書データを収集する収集手段を更に備えてもよい。さら
に、本発明によれば、リンク関係を有するリンク集生成
システムにおいて、収集手段、リンク重要度付与手段、
ＵＲＬ文字列判別手段及び索引生成手段を備える。Further, the document search device may further include a collection unit for collecting document data from the network. Further, according to the present invention, in a link collection generation system having a link relationship, a collection unit, a link importance assignment unit,
A URL character string determining unit and an index generating unit are provided.

【００３１】収集手段はネットワークから文書データを
収集し、リンク重要度付与手段は文書データのリンク関
係を用いて算出されたリンク重要度を文書データに付与
し、ＵＲＬ文字列判別手段は文書データから特定の文字
列上の特徴を持つＵＲＬを判別する。索引生成手段は、
リンク重要度及びＵＲＬの特定の文字列上の特徴に基づ
いて、文書データへのリンクを順に一定個数以内並べた
リンク集を生成する。文書データのＵＲＬの特定の文字
列上の特徴は、その文書データの内容を示すことがあ
る。例えば、ＪＡＶＡ（登録商標）に関する文書データ
のＵＲＬには、ＪＡＶＡやｊａｖａといった文字列が用
いられることがある。従って、ＵＲＬの特定の文字列上
の特徴は、文書データの内容を推定することに用いるこ
とができる。故に、リンク重要度及びＵＲＬの特定の文
字列上の特徴に基づいて文書データへのリンク集を作成
することにより、利用者が閲覧を希望する内容を含む文
書データを検索することが可能なリンク集を自動的に生
成することが可能となる。The collecting means collects the document data from the network, the link importance assigning means assigns the link importance calculated using the link relation of the document data to the document data, and the URL character string discriminating means assigns the URL character string from the document data. A URL having a characteristic on a specific character string is determined. The index generation means includes:
A link collection in which links to the document data are arranged in order within a certain number based on the link importance and the characteristic of the URL in a specific character string is generated. A characteristic of a specific character string of the URL of the document data may indicate the content of the document data. For example, a character string such as JAVA or Java may be used for the URL of document data related to JAVA (registered trademark). Therefore, the features on the specific character string of the URL can be used for estimating the contents of the document data. Therefore, by creating a link collection to the document data based on the link importance and the characteristic of the specific character string of the URL, a link that allows the user to search for the document data including the content desired to be viewed. The collection can be automatically generated.

【００３２】また、リンク集生成システムは、さらに文
書タイプ判別手段を備えてもよい。文書タイプ判別手段
は、上述のようにして文書データの文書タイプを判別
し、索引生成手段は、文書タイプに基づいて文書データ
を選択し、選択された文書データについてリンク重要度
及びＵＲＬの特定の文字列上の特徴に基づいて、リンク
集を生成する。これにより、より質の高い文書データへ
のリンク集を生成することが可能となる。Further, the link collection generating system may further include a document type discriminating means. The document type determining means determines the document type of the document data as described above, and the index generating means selects the document data based on the document type, and specifies the link importance and URL of the selected document data. Generate a link collection based on the features on the character string. As a result, a collection of links to higher-quality document data can be generated.

【００３３】また、本発明の範囲は、上述の装置が実現
する処理の手順からなる方法をも含む。また、さらに、
本発明の範囲は、上述の処理を前記コンピュータに実行
させうるプログラムを記録する記録媒体をも含む。The scope of the present invention also includes a method comprising a procedure of a process realized by the above-described apparatus. Also,
The scope of the present invention also includes a recording medium that records a program that allows the computer to execute the above-described processing.

【００３４】[0034]

【発明の実施の形態】以下、図面を参照しながら本発明
の実施の形態を詳細に説明する。図１は第１実施形態に
係わる文書検索装置の構成を示す。図１の文書検索装置
は、処理装置１１、入力装置１２及び表示装置１３を備
える。処理装置１１は、例えば、ＣＰＵ（Ｃｅｎｔｒａ
ｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）とメモリを含
み、入力装置１２はキーボード、マウス等に対応し、表
示装置１３はディスプレイ等に対応する。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 shows a configuration of a document search device according to the first embodiment. 1 includes a processing device 11, an input device 12, and a display device 13. The processing device 11 is, for example, a CPU (Centra).
l Processing Unit) and a memory, the input device 12 corresponds to a keyboard, a mouse and the like, and the display device 13 corresponds to a display and the like.

【００３５】また、処理装置１１は、リンク重要度付与
器２１、キーワード抽出器２２、キーワード文書関連度
計算器２３、索引生成器２４、索引アクセス部２５及び
アクセス解析器２６を備える。これらは、例えば、プロ
グラムにより記述されたソフトウェアコンポーネントに
対応し、処理装置１１の特定のプログラムコードセグメ
ントに格納される。The processing apparatus 11 includes a link importance assigning unit 21, a keyword extracting unit 22, a keyword document relevance calculating unit 23, an index generating unit 24, an index accessing unit 25, and an access analyzing unit 26. These correspond to, for example, software components described by a program, and are stored in a specific program code segment of the processing device 11.

【００３６】リンク重要度付与器２１は、例えばＷＷＷ
ページのような文書データ３０からリンク情報を抽出す
る。例えば、ＷＷＷページの場合ＨＴＭＬを解析し、<a
href=“http://www.fujitsu.co.jp/"> 富士通トップ</
a>のようにアンカー(a) タグの部分を取り出す。そし
て、抽出されたリンク情報に基づいてリンク重要度３１
を算出し、算出されたリンク重要度３１をキーワード文
書関連度計算器２３へ出力する。また、リンク重要度付
与器２１は、リンク元とリンク先のＵＲＬの字面の類似
度であるＵＲＬ類似度を算出するＵＲＬ類似度計算器２
７を備える。ＵＲＬ類似度計算器２７はリンク元とリン
ク先のＵＲＬ類似度を算出し、リンク重要度付与器２１
は、抽出されたリンク関係及びＵＲＬ類似度に基づいて
リンク重要度３１を算出する。The link importance assigning unit 21 is, for example, a WWW.
Link information is extracted from document data 30 such as a page. For example, in the case of a WWW page, HTML is analyzed and <a
href = “http://www.fujitsu.co.jp/”> Top Fujitsu </
Take out the anchor (a) tag part like a>. Then, based on the extracted link information, the link importance level 31
Is calculated, and the calculated link importance 31 is output to the keyword document relevance calculator 23. Further, the link importance assigning unit 21 calculates a URL similarity calculator 2 that calculates a URL similarity which is a similarity between the face of the link source and the URL of the link destination.
7 is provided. The URL similarity calculator 27 calculates the URL similarity between the link source and the link destination and calculates the link similarity calculator 21.
Calculates the link importance 31 based on the extracted link relationship and the URL similarity.

【００３７】キーワード抽出器２２は文書データ３０か
らキーワードを抽出し、その結果をページキーワード３
２として出力する。また、キーワード抽出器２２は抽出
されたキーワードが文書データ３０中で出現する全出現
頻度を集計することとしてもよい。キーワード抽出器２
２は、例えば、文書データ３０が日本語で記述されてい
る場合に、形態素解析（単語切り）を行い、名詞（列）
をキーワードとして抽出する。また、簡単な表記の揺れ
（「コンピュータ」と「コンピューター」など）はルー
ルや小規模なルールで統一しておく。同義語の情報は例
えば、外付けで辞書などにより与えられる。The keyword extractor 22 extracts a keyword from the document data 30, and outputs the result as a page keyword 3
Output as 2. Further, the keyword extractor 22 may totalize all occurrence frequencies of the extracted keywords in the document data 30. Keyword extractor 2
2, for example, when the document data 30 is described in Japanese, morphological analysis (word segmentation) is performed, and nouns (columns)
Is extracted as a keyword. In addition, the fluctuation of simple notation (such as "computer" and "computer") should be unified with rules and small rules. The synonym information is provided, for example, externally by a dictionary or the like.

【００３８】キーワード文書関連度計算器２３は、リン
ク重要度３１、ページキーワード３２及び後述のアクセ
スログ３４に基づいて、キーワードと文書の関連度であ
るキーワード文書関連度を計算し、計算結果を索引生成
器２４へ出力する。The keyword document relevance calculator 23 calculates a keyword document relevance, which is a relevancy between a keyword and a document, based on the link importance 31, the page keyword 32, and an access log 34 described later, and indexes the calculation result. Output to the generator 24.

【００３９】索引生成器２４は、キーワード文書関連度
に基づいて索引データ３３を生成し、生成された索引デ
ータ３３を索引アクセス部２５に出力する。索引データ
３３は例えばハイパーテキストを用いて生成される。The index generator 24 generates index data 33 based on the keyword document relevance, and outputs the generated index data 33 to the index access unit 25. The index data 33 is generated using, for example, hypertext.

【００４０】索引アクセス部２５は、入力装置１２から
入力される利用者の指示に従って、索引データ３３の内
容を表示装置１３に表示し、利用者のアクセス状況を示
す情報をアクセス解析器２６に出力する。The index access unit 25 displays the contents of the index data 33 on the display device 13 and outputs information indicating the access status of the user to the access analyzer 26 in accordance with a user's instruction input from the input device 12. I do.

【００４１】アクセス解析器２６は、利用者のアクセス
状況を示す情報を解析し、一定期間内に各キーワードか
らアクセスされた文書を集計したアクセスログ３４を作
成し、作成されたアクセスログ３４をキーワード文書関
連度計算器２３に出力する。The access analyzer 26 analyzes information indicating the access status of the user, creates an access log 34 in which documents accessed by each keyword within a certain period are created, and uses the created access log 34 as a keyword. Output to the document relevance calculator 23.

【００４２】次に、図２から図５を参照しながら各種デ
ータのデータ構造について説明する。図２は文書情報を
格納するテーブル集を示す。文書情報を格納するテーブ
ル集は、文書情報テーブル４１及び被参照文書テーブル
４２を含む。文書情報テーブル４１は、文書ＩＤ、ＵＲ
Ｌ、タイトル、被参照文書テーブル４２へのリンク及び
文書データのリンク重要度からなる。文書ＩＤは文書デ
ータにユニークにつけられた数値、ＵＲＬはインターネ
ットで文書データを指す情報、タイトルはその文書デー
タのタイトル、被参照文書テーブル４２はその文書デー
タにリンクしているリンク元の文書データの集合を格納
する。被参照文書テーブル４２は文書ＩＤ及びリンク先
とリンク元の文書データのＵＲＬ上の字面の類似度であ
るＵＲＬ類似度の対情報を格納する。１文書毎に高々１
つの被参照文書テーブルがある。文書情報テーブル４１
及び被参照文書テーブル４２は、リンク重要度付与器２
１が作成するリンク重要度３１に相当する。Next, the data structure of various data will be described with reference to FIGS. FIG. 2 shows a table collection for storing document information. The table collection for storing document information includes a document information table 41 and a referenced document table 42. The document information table 41 includes a document ID, a UR
L, a title, a link to the referenced document table 42, and a link importance of the document data. The document ID is a numerical value uniquely assigned to the document data, the URL is information indicating the document data on the Internet, the title is the title of the document data, and the referenced document table 42 is the document data of the link source document data linked to the document data. Stores a set. The referenced document table 42 stores the document ID and pair information of the URL similarity, which is the similarity of the character face on the URL between the link destination and the link source document data. At most one per document
There are three referenced document tables. Document information table 41
And the referenced document table 42 are stored in the link importance assigning device 2.
1 corresponds to the link importance 31 created.

【００４３】図３はキーワード情報を格納するテーブル
集を示す。キーワード情報を格納するテーブル集はキー
ワードテーブル５１、キーワード対応テーブル５２及び
出現文書テーブル５３を含む。キーワードテーブル５１
は、キーワードＩＤ、代表語、出現文書テーブル５３へ
のリンクからなる。代表語とは、同じキーワードＩＤを
持つキーワードが複数ある場合に、どれを代表とするか
を示すの情報である。キーワード対応テーブル５２は、
実際のキーワードと、読み（又はスペル）、キーワード
ＩＤとの対情報からなる。ここで、同一概念を表すキー
ワード（例えば、「コンピュータ」「computer」「計算
機」など）には同一のキーワードＩＤ（ｋｗＩＤ）がふ
られている。また、日本語のキーワードの読みについて
は、読みの中の「―」を除去したり、「ぁ」や「ぃ」の
ような拗撥音を「あ」や「い」としたりして表記を統一
しておく。英語のキーワードについては、大文字に統一
しておく。これにより、「コンピューター」や「コンピ
ュータ」といった表記のゆれによって同じ概念を示すキ
ーワードが異なるように扱われることを防ぎ、生成され
る索引においてキーワードの統一を行うことができる。
出現文書テーブル５３は、当該キーワードを含む文書デ
ータの文書ＩＤとそのキーワードの出現頻度からなる。
キーワード対応テーブル５２及びキーワードテーブルの
代表語は予め処理装置１１に備えられている（不図
示）。出現文書テーブル５３はキーワード抽出器２２に
より作成されるページキーワード３２に相当する。FIG. 3 shows a table collection for storing keyword information. The table collection for storing the keyword information includes a keyword table 51, a keyword correspondence table 52, and an appearance document table 53. Keyword table 51
Consists of a keyword ID, a representative word, and a link to the appearing document table 53. The representative word is information indicating which one is representative when there are a plurality of keywords having the same keyword ID. The keyword correspondence table 52 is
It consists of pair information of an actual keyword, reading (or spelling), and keyword ID. Here, the same keyword ID (kwID) is assigned to keywords representing the same concept (for example, “computer”, “computer”, “computer”, etc.). In addition, when reading Japanese keywords, the notation is unified by removing "-" in the reading and changing the repellent sounds like "ぁ" and "や" to "A" and "I". Keep it. For English keywords, capitalize them. As a result, it is possible to prevent keywords indicating the same concept from being treated differently due to variations in notations such as “computer” and “computer”, and to unify keywords in a generated index.
The appearance document table 53 includes a document ID of the document data including the keyword and an appearance frequency of the keyword.
The keyword correspondence table 52 and the representative words of the keyword table are provided in the processing device 11 in advance (not shown). The appearance document table 53 corresponds to the page keyword 32 created by the keyword extractor 22.

【００４４】図４は索引情報を格納するテーブル集を示
す。索引情報を格納するテーブル集は、索引情報テーブ
ル６１、関連文書テーブル６２及び関連キーワードテー
ブル６３を含む。索引情報テーブル６１は、索引文字
列、継続する文字列の並び、関連キーワード列からな
る。索引情報テーブル６１は、図３のキーワード対応テ
ーブル５２に格納されたキーワードとその読み（又はス
ペル）及びキーワード文書に基づいて、索引データ生成
部２４が後述の方法により文字列グラフを生成し、それ
を縮退することにより作成される。図４では、例えば、
トップが「あ」「い」・・・。「あ」を選ぶと次に「あ
いぼ」，「あお」，・・・が現れることを示している。
また、文字列「あいぼ」から考えられるキーワードは
「相棒」「アイボリー」であることを示している。ここ
でキーワードは、図４のキーワード対応テーブル５２に
含まれるキーワードである。関連文書テーブル６２は、
キーワードＩＤから関連する文書のＩＤである関連文書
ＩＤを得るためのテーブルである。キーワード文書関連
度計算器２３は文書関連度を計算し、その計算結果に基
づいて、文書関連度の高いものから順に一定個数以内の
関連文書ＩＤ列を関連文書テーブル６２に格納する。関
連キーワードテーブル６３は、文書ＩＤから関連キーワ
ードＩＤを得るためのテーブルである。関連文書テーブ
ル６２と関連キーワードテーブル６３の情報は同じで、
それを転置したものである。なお、関連文書ＩＤの詳細
情報は、図２の文書情報テーブル４１に格納されてい
る。FIG. 4 shows a table collection for storing index information. The table collection for storing index information includes an index information table 61, a related document table 62, and a related keyword table 63. The index information table 61 includes an index character string, a sequence of continued character strings, and a related keyword string. In the index information table 61, the index data generation unit 24 generates a character string graph by a method described later based on the keywords stored in the keyword correspondence table 52 in FIG. Is created by degenerating. In FIG. 4, for example,
The top is "A""I" ... When "A" is selected, "Aibo", "Ao", ... appear next.
In addition, the keyword that can be considered from the character string “Aibo” indicates that “Aibo” and “Ivory”. Here, the keyword is a keyword included in the keyword correspondence table 52 of FIG. The related document table 62 is
It is a table for obtaining a related document ID which is an ID of a related document from a keyword ID. The keyword document relevance calculator 23 calculates the document relevance, and based on the calculation result, stores the related document ID strings within a certain number in the descending order of the document relevance in the related document table 62. The related keyword table 63 is a table for obtaining a related keyword ID from a document ID. The information in the related document table 62 and the related keyword table 63 are the same,
It is transposed. The detailed information of the related document ID is stored in the document information table 41 of FIG.

【００４５】図５は、アクセスログ情報を格納するテー
ブルであるアクセスログを示す。アクセスログ７１は、
利用者がキーワード情報画面（後述）から文書データを
選んだ際に関する情報、つまり、アクセス日時、キーワ
ードＩＤ及び選ばれた文書データの文書ＩＤの対情報等
を格納する。アクセスログ７１は、アクセス解析器２６
が作成するアクセスログ３４に相当する。一定期間内の
ログを集計することにより、特定キーワードから特定文
書がどのくらいアクセスされたかを得ることができる。FIG. 5 shows an access log which is a table for storing access log information. The access log 71 is
Stores information on when the user selects document data from a keyword information screen (described later), that is, access date and time, keyword ID, pair information of the document ID of the selected document data, and the like. The access log 71 is stored in the access analyzer 26.
Corresponds to the access log 34 created. By counting logs within a certain period, it is possible to obtain how many times a specific document has been accessed from a specific keyword.

【００４６】以下、文書検索装置全体の動作について図
６を用いて説明する。図６は索引生成処理の手順を示
す。まず、リンク重要度付与器２１は、文書データから
リンク情報及びＵＲＬ等を抽出する。続いて、抽出され
た情報を文書情報テーブル４１の文書ＩＤ、ＵＲＬフィ
ールドに格納し、被参照文書テーブル４２へのリンク
（ポインタ）及び被参照文書テーブル４２を生成する
（ステップＳ１）。Hereinafter, the operation of the entire document search apparatus will be described with reference to FIG. FIG. 6 shows the procedure of the index generation process. First, the link importance level assigner 21 extracts link information, a URL, and the like from the document data. Subsequently, the extracted information is stored in the document ID and URL fields of the document information table 41, and a link (pointer) to the referenced document table 42 and the referenced document table 42 are generated (step S1).

【００４７】リンク重要度付与器２１に備えられたＵＲ
Ｌ類似度計算器２７は、抽出されたリンク情報及びＵＲ
Ｌに基づいて、リンク元とリンク先のＵＲＬ類似度を計
算し、被参照文書テーブル４２のフィールドに格納す
る。UR provided in the link importance assigning unit 21
The L similarity calculator 27 calculates the extracted link information and UR
Based on L, the URL similarity between the link source and the link destination is calculated and stored in the field of the referenced document table 42.

【００４８】次に、リンク重要度付与器２１は抽出され
たリンク情報及びＵＲＬ類似度に基づいてリンク重要度
を算出し、文書情報テーブル４１のフィールドに格納す
る（ステップＳ２）。ＵＲＬ類似度及びリンク重要度の
算出については後述する。Next, the link importance assigning unit 21 calculates the link importance based on the extracted link information and the URL similarity, and stores it in the field of the document information table 41 (step S2). The calculation of the URL similarity and the link importance will be described later.

【００４９】キーワード抽出器２２は、文書データから
キーワードを抽出し、キーワード対応テーブル５２のキ
ーワードとキーワードＩＤのフィールド、キーワードテ
ーブル５１の全フィールド及び出現文書テーブル５３の
文書ＩＤと頻度のフィールドに格納する（ステップＳ
３）。キーワードの抽出は、例えば、文書データ３０が
日本語で記述されている場合に、キーワード抽出器２２
は形態素解析（単語切り）を行い、単語切りによって得
られた名詞（列）から行う。また、簡単な表記の揺れ
（「コンピュータ」と「コンピューター」など）はルー
ルや小規模な辞書で統一しておく。同義語の情報は外付
けで辞書などにより与えられる。The keyword extractor 22 extracts keywords from document data and stores them in the keyword and keyword ID fields of the keyword correspondence table 52, all fields of the keyword table 51, and the document ID and frequency fields of the appearing document table 53. (Step S
3). The keyword is extracted, for example, when the document data 30 is described in Japanese,
Performs morphological analysis (word segmentation) and performs the noun (sequence) obtained by word segmentation. Also, the fluctuation of simple notation (such as "computer" and "computer") should be standardized in rules and small dictionaries. Synonym information is provided externally by a dictionary or the like.

【００５０】次にキーワード抽出器２２は、抽出された
キーワードの読み付けを上述のようにして統一された表
記法則に基づいて行い、その結果をキーワード対応テー
ブル５２の読みフィールド（又はスペル）に格納する
（ステップＳ４）。キーワード対応テーブル５２ではキ
ーワードの表記が統一されて格納されているため、これ
により、生成される索引のキーワードを統一することが
できる。Next, the keyword extractor 22 reads the extracted keywords based on the unified rules described above, and stores the result in the reading field (or spelling) of the keyword correspondence table 52. (Step S4). In the keyword correspondence table 52, keywords are unified and stored, so that the keywords of the generated index can be unified.

【００５１】次にキーワード抽出器２２は、抽出された
キーワードが文書データ中で出現する全出現頻度を集計
し、その結果に基づいてキーワードテーブル５１の出現
文書のポインタを生成し、出現文書テーブル５３の文書
ＩＤ及び頻度のフィールドに集計された頻度を格納する
（ステップＳ５）。さらに、キーワード抽出器２２は、
キーワードＩＤの全出現頻度を集計し、上位一定個数
（例えば10,000語）のキーワードを索引の対象キーワー
ドとして決定し、当該キーワードＩＤ以外のエントリ
を、キーワードテーブル５１及びキーワード対応テーブ
ル５２から除く。Next, the keyword extractor 22 counts all the appearance frequencies of the extracted keywords in the document data, generates a pointer to the appearing document in the keyword table 51 based on the result, and generates an appearing document table 53. Is stored in the document ID and frequency fields (step S5). Furthermore, the keyword extractor 22
The total appearance frequency of the keyword IDs is totaled, a fixed number of keywords (for example, 10,000 words) are determined as target keywords for the index, and entries other than the keyword ID are excluded from the keyword table 51 and the keyword correspondence table 52.

【００５２】次に、キーワード文書関連度計算器２３
は、文書情報テーブル４１のリンク重要度、被参照文書
テーブル４２のＵＲＬ類似度及びアクセスログ７１に基
づいてキーワードと文書の関連度を示すキーワード文書
関連度を計算し、キーワード文書関連度が上位から一定
個数にある文書を関連文書として決定し、その決定に基
づいて関連文書テーブル６２及び関連キーワードテーブ
ル６３の関連文書ＩＤ列のフィールドに情報を格納する
（ステップＳ６）。Next, the keyword document relevance calculator 23
Calculates the keyword document relevance indicating the relevance between the keyword and the document based on the link importance of the document information table 41, the URL similarity of the referenced document table 42, and the access log 71, and calculates the keyword document relevance from the top. A predetermined number of documents are determined as related documents, and information is stored in the related document ID column fields of the related document table 62 and the related keyword table 63 based on the determination (step S6).

【００５３】次に、索引生成器２４は、キーワード対応
テーブル５２のエントリキーワードとその読み（又はス
ペル）に基づいて、文字列グラフを生成し、それを縮退
し、その結果を索引情報テーブル６１に格納する（ステ
ップＳ７）。縮退については後述する。Next, the index generator 24 generates a character string graph based on the entry keywords in the keyword correspondence table 52 and their readings (or spellings), degenerates them, and stores the result in the index information table 61. It is stored (step S7). The degeneration will be described later.

【００５４】次に、索引生成器２４は、索引情報テーブ
ル、関連文書テーブル、関連キーワードテーブル及び文
書情報テーブルに基づいて索引を作成する（ステップＳ
８）。索引は例えばハイパーテキストで作成される。作
成された索引は、表示装置１３に表示されてもよい。Next, the index generator 24 creates an index based on the index information table, the related document table, the related keyword table and the document information table (step S).
8). The index is created by, for example, hypertext. The created index may be displayed on the display device 13.

【００５５】生成された索引は索引アクセス部２５を介
して表示装置１３に出力され、利用者は表示された索引
を用いて入力をする。索引アクセス部２５は利用者のア
クセス状況を示す情報をアクセス解析器２６に出力す
る。アクセス解析器２６はアクセス状況を示す情報を解
析してアクセスログ３４を作成する（不図示）。The generated index is output to the display device 13 via the index access unit 25, and the user inputs using the displayed index. The index access unit 25 outputs information indicating the access status of the user to the access analyzer 26. The access analyzer 26 analyzes information indicating the access status and creates an access log 34 (not shown).

【００５６】以下、文書検索装置のリンク重要度付与器
がリンク重要度を算出する手順について説明する。ま
ず、本実施形態において、リンク重要度付与器２１はリ
ンク重要度を付与する際に、その文書データのリンク関
係、ＵＲＬ及びキーワードを利用する。なお、リンク関
係に基づいて判定される文書データの重要度をリンク重
要度という。リンク重要度を判定する際の基本的な考え
方は以下の通りである。１．類似度の低いＵＲＬから多くリンクされている文書
データ（ページ）は重要である。The procedure by which the link importance assigning device of the document retrieval apparatus calculates the link importance will be described below. First, in the present embodiment, when assigning the link importance, the link importance assigning device 21 uses the link relation, the URL, and the keyword of the document data. The importance of the document data determined based on the link relationship is referred to as link importance. The basic concept for determining the link importance is as follows. 1. Document data (pages) linked from URLs with low similarity are important.

【００５７】例えば、一般に、同一サイト内に設けられ
た複数のＷＷＷページ（ページ）はそのサイト内の他の
ページにリンクされているが、それらのページのＵＲＬ
は相互に類似する。従って、類似度の高いＵＲＬからリ
ンクされているページの重要度は低いと推定できるから
である。２．多くのページからリンクされているページほど重要
なページであり、重要なページからリンクされている、
ＵＲＬの類似度の低いページは重要である。For example, generally, a plurality of WWW pages (pages) provided in the same site are linked to other pages in the site, but the URLs of those pages are linked.
Are similar to each other. Therefore, it is possible to estimate that the importance of a page linked from a URL having a high similarity is low. 2. Pages that are linked from many pages are important pages, and are linked from important pages.
Pages with low URL similarity are important.

【００５８】例えば、有名なディレクトリサービス等及
び官公庁等は多くのページからリンクされているが、こ
のような重要なページからリンクされている文書の方
が、個人が開設するページやそのコンテンツのエントリ
ページからリンクされている文書よりも重要度が高いと
考えられるからである。また、多くのページやミラーサ
イトを抱えるサービス（サイト）に設けられたページ等
はそのサイト内でリンクされていることが多い。そのた
め、従来技術において、同じサイトのページが多く検索
されてしまうという問題があった。しかし、そのサイト
内のページのＵＲＬは例えばドメインが同じ等大抵類似
しているため、「ＵＲＬの類似度の低いページは重要で
ある」という考え方を導入すれば、この問題を解消する
ことが可能となる。３．ＵＲＬの類似度は、サーバアドレス、パス、ファイ
ル名の全てが異なるものが最も小さく、ミラーサイトや
同一サーバ内のページは類似度が高くなるように、ＵＲ
Ｌの字面情報から定義する。For example, famous directory services and government offices are linked from many pages, and documents linked from such important pages are more likely to be pages opened by individuals and entries of their contents. This is because it is considered that the importance is higher than the document linked from the page. In addition, many pages and pages provided in a service (site) having a mirror site are often linked in the site. Therefore, in the related art, there is a problem that many pages of the same site are searched. However, since the URLs of the pages in the site are almost the same, for example, in the same domain, it is possible to solve this problem by introducing the idea that "pages with low URL similarity are important". Becomes 3. The similarity of the URL is the smallest when the server address, path, and file name are all different, and the URL is similar so that the mirror site or the page in the same server has the higher similarity.
Defined from L face information.

【００５９】上述の３つの考え方を導入することによ
り、全てのリンク関係を同等に扱わないでリンク重要度
に応じた重みをリンク関係に与えることとしている。よ
り具体的には、リンクの重みをリンク元とリンク先文書
のＵＲＬの類似度の逆数として与えることとしている。
これにより、単純にリンクされている数だけで文書の重
要度を判定しすべてのリンクを同等に扱っている従来技
術における問題、つまり、大量のページやミラーサイト
を抱えるサーバ（サイト）や個人が単に量が多いという
だけで重要度が高いことになるという問題を解決するこ
とが可能となる。また、故意にサイト内のページを増や
してリンクを設定することによりページの重要度を上げ
ようとしても、同じサイトのページのＵＲＬ類似度は高
いために、ページの重要度を上げることは従来よりも困
難になるという効果もある。By introducing the above three concepts, all link relationships are not treated equally, and a weight corresponding to the link importance is given to the link relationships. More specifically, the link weight is given as the reciprocal of the similarity between the URLs of the link source and link destination documents.
This is a problem in the prior art where the importance of a document is simply determined based on the number of linked pages and all links are treated equally, that is, a server (site) or an individual having a large number of pages or mirror sites can be used. It is possible to solve the problem that the importance is high simply because the amount is large. Also, even if the number of pages within the site is intentionally increased and links are set to increase the importance of the page, since the URL similarity of the page of the same site is high, increasing the importance of the page is more difficult than before. Also has the effect of becoming difficult.

【００６０】以下、リンク重要度付与器２１におけるリ
ンク重要度の算出についてより詳しく説明する。リンク
重要度の算出対象となるページ集合をＤＯＣ＝｛p₁，
p₂,....p_N｝、ページｐのリンク重要度をＷ_p、ページ
ｐのリンク先のページ集合をＲef(p) 、ページｐのリン
ク元のページ集合をＲefed(p) 、ページｐとｑのＵＲＬ
類似度をsim(p,q)、相異度をdiff(p,q) ＝1/sim(p,q)と
すると、ページｐからｑにリンクが張られているとした
時、そのリンクの重みlw(p,q) を以下の（１）式で定義
する。Hereinafter, the calculation of the link importance by the link importance assigning device 21 will be described in more detail. The set of pages for which the link importance is calculated is DOC = ｛p ₁ ,
p ₂ , .... p _N ｝, the link importance of page p is W _p , the set of linked pages of page p is Ref (p), the set of linked pages of page p is Refed (p), the page URL of p and q
Assuming that the similarity is sim (p, q) and the difference is diff (p, q) = 1 / sim (p, q), when it is assumed that a link is provided from page p to q, the link The weight lw (p, q) is defined by the following equation (1).

【００６１】[0061]

【数３】 (Equation 3)

【００６２】この（１）式から分かるように、lw(p,q)
は、ｐとｑのＵＲＬの類似度sim(p,q)が低いほど、ま
た、ｐからのリンク数がより少ないほど大きくなる。各
ページのリンク重要度は、各ｐ∈ＤＯＣに対して、Ｃ_q
を定数（重要度の下限であり、ページによって異なる値
を与えてもよい。）として、As can be seen from the equation (1), lw (p, q)
Is larger as the similarity sim (p, q) of the URLs of p and q is lower and the number of links from p is smaller. The link importance of each page is C _q for each p∈DOC.
As a constant (the lower limit of importance, which may be different for each page)

【００６３】[0063]

【数４】 (Equation 4)

【００６４】という連立一次方程式の解として定義す
る。リンク重要度付与器２１は、この連立一次方程式を
解くことにより、リンク重要度を各ページに付与する。
なお、このような連立一次方程式の解法については、既
存のアルゴリズムが多数存在するため、説明は省略す
る。（１）式中のＵＲＬ類似度sim(p,q)はリンク重要度
付与器２１に備えられたＵＲＬ類似度計算器２７によ
り、計算される（後述）。（１）式及び（２）式から、
上述の考え方が実現されていることを読み取ることがで
きる。すなわち、（１）式から類似度が低ければ、重み
ｌｗは大となるから（２）式から類似度の低いＵＲＬか
ら多くリンクされている文書ページは重要となる。ま
た、（２）式から多くのページからリンクされているペ
ージほど重要なページとなる。Is defined as the solution of the simultaneous linear equation. The link importance assigning unit 21 assigns link importance to each page by solving this simultaneous linear equation.
In addition, since there exist many existing algorithms about the solution of such a simultaneous linear equation, description is abbreviate | omitted. The URL similarity sim (p, q) in the equation (1) is calculated by a URL similarity calculator 27 provided in the link importance assigning unit 21 (described later). From equations (1) and (2),
It can be seen that the above concept has been realized. That is, if the degree of similarity is low from the equation (1), the weight lw becomes large, so that the document page linked from a URL with a low degree of similarity from the equation (2) becomes important. From equation (2), pages linked from more pages are more important pages.

【００６５】さらに、（２）式から重要なページ（Ｗ
ｑ）からリンクされているＵＲＬの類似度の低いページ
（リンクの重みｌｗの高い）は重要である。以下、図７
及び図８を用いて、（１）式及び（２）式の示す考え方
についてより詳しく説明する。Furthermore, the important page (W
A page with low URL similarity (high link weight lw) linked from q) is important. Hereinafter, FIG.
With reference to FIG. 8 and FIGS. 8A and 8B, the concept represented by Expressions (1) and (2) will be described in more detail.

【００６６】図７は、（１）式及び（２）式の示す内容
を模式的に示す。図７において、まるは各ページ、矢印
はリンク関係、矢印の方向はリンクの方向、矢印の太さ
がリンクの重みを示す。図７に示すように、ページ
ｐ₁、ｐ₂及びｐ₃からページｑへリンクがはられてい
る。ページｐ₁はページｑ以外の２つのページｒ₁及び
ｒ ₂にもリンクし、ページｐ₃は他の２つのページｓ₁
及びｓ₂からリンクされている。FIG. 7 shows the contents of the expressions (1) and (2).
Is schematically shown. In FIG. 7, each page is an arrow.
Is link relation, arrow direction is link direction, arrow thickness
Indicates the link weight. As shown in FIG.
p₁, P_TwoAnd p_ThreeLink to page q from
You. Page p₁Is two pages r other than page q₁as well as
r _TwoLink to the page p_ThreeIs the other two pages₁
And s_TwoLinked from.

【００６７】ここで、各ページのＵＲＬ類似度は、 sim(p1,q) =sim(p1,r1)=sim(p2,r1)= １ sim(p2,q) = ２、（つまり、ページｐ₂及びｑのＵＲＬ
は若干類似する） sim(p3,q) = １、sim(s1,p3)=sim(s2,p3)=３（つまり、
ページｓ₁、ｓ₂及びｐ₃のＵＲＬは類似する）とする。Here, the URL similarity of each page is as follows: sim (p1, q) = sim (p1, r1) = sim (p2, r1) = 1 1 sim (p2, q) = 2, ie, page p URL of ₂ and q
Sim (p3, q) = 1, sim (s1, p3) = sim (s2, p3) = 3 (that is,
The URLs of pages s ₁ , s ₂ and p ₃ are similar).

【００６８】（１）式及び（２）式を図７のような場合
に適用すると、各ページｐ₁、ｐ₂、ｐ₃、ｓ₁及びｓ
₂のリンクの重みは、以下のようになる。When the equations (1) and (2) are applied to the case as shown in FIG. 7, each page p ₁ , p ₂ , p ₃ , s ₁ and s
The weight of link ₂ is as follows.

【００６９】 lw(p1,q)＝１／｛１×（１＋１＋１）｝＝１／３ lw(p2,q)＝１／｛２×（１／２）｝＝１ lw(p3,q)＝１ lw(s1,p3) ＝lw(s2,p3) ＝１／３故に、（１）式及び上述の計算からリンク先の多いペー
ジｐ₁のリンクの重みlw(p1,q)は小さくなることが分か
る。また、同様に（１）式及び上述の計算からＵＲＬ類
似度が大きいほどリンクの重みは小さくなることも分か
る。Lw (p1, q) = 1 / {1 × (1 + 1 + 1)} = 1/3 lw (p2, q) = 1 / {2 × (1/2)} = 1 lw (p3, q) = Thus 1 lw (s1, p3) = lw (s2, p3) = 1/3, (1) and formula weight lw (p1, q) links many pages p ₁ of the destination from the above calculation is that smaller I understand. Similarly, it can be seen from equation (1) and the above calculation that the greater the URL similarity, the smaller the link weight.

【００７０】また、ページｑのリンク重要度Ｗ_qは以下
のようになる。Ｗ_q＝Ｃ_q＋｛lw(p1,q)×Ｗ_p1＋lw(p2,q)×Ｗ_p2＋lw(p3,q)×Ｗ_p3｝＝Ｃ_q＋｛（Ｗ_p1／３）＋Ｗ_p2＋Ｗ_p3｝Ｗ_p1＝Ｃ_p1 Ｗ_p2＝Ｃ_p2 Ｗ_p3＝Ｃ_p3＋｛lw(s1,p3) ×Ｗ_s1＋lw(s2,p3) ×Ｗ_s2｝＝Ｃ_p3＋（Ｗ_s1＋Ｗ_s2）／３故に、ページｐ₁及びｐ₂と比べてより多くのページか
らリンクされているページｐ₃のリンク重要度Ｗ_p3は大
きくなっている。また、ページｑのリンク重要度Ｗ_qを
みれば、重要なページと思われるから、ＵＲＬ類似度の
低いページ、すなわちリンク重みlw(p3,q)＝１、いいか
えるとリンク重み（リンク重みの最大値は＝）が大きい
ページほど、リンク重要度が大きくなることが分かる。
また、（２）式及び上述の計算から、ＵＲＬの類似して
いる同じサイト内のページからのリンクの重みを、他の
ＵＲＬの類似していないページからのリンクの重みより
も軽く扱うことが分かる。これにより、大量のページを
抱えるサイトのページが検索結果の中に多く出てくると
いう問題を解決できていることが分かる。The link importance W _q of the page _q is as follows. _{_{W q = C q + {lw}} (p1, q) × W p1 + lw (p2, q) × W p2 + lw (p3, q) × W p3} = C q + {(W p1 / 3) + W p2 + W p3 ＷW _p1 = C _p1 W _p2 = C _p2 W _p3 = C _p3 + ｛lw (s1, p3) × W _s1 + lw (s2, p3) × W _s2 ｝ = C _p3 + (W _s1 + W _s2 ) / 3 , the link importance W _p3 page p ₃ linked from more pages than the page p ₁ and p ₂ is larger. Maximum Also, look at the link importance W _q of the page q, because seems important pages, low URL similarity pages, i.e. link weight lw (p3, q) = 1 , in other words when the link weights (the link weight It can be seen that the greater the value of =), the greater the link importance.
Also, from equation (2) and the above calculation, the weight of a link from a page in the same site with a similar URL can be treated lighter than the weight of a link from a page with a similar URL with another URL. I understand. As a result, it can be understood that the problem that many pages of a site having a large number of pages appear in the search results can be solved.

【００７１】図８は、（１）式及び（２）式の示す内容
を模式的に示す。図８においても、図７と同様に、まる
は各ページ、矢印はリンク関係、矢印の方向はリンクの
方向、矢印の太さがリンクの重みを示す。さらに、図８
において網掛されているまるはＵＲＬ類似度が高いペー
ジを示す。図８（ａ）及び８（ｂ）において、ページｑ
はページｐ₁、ｐ₂及びｐ₃からリンクされている。さ
らに、図８（ｂ）において、ページｑはページｐ₁、ｐ
₂及びｐ₃のＵＲＬは類似しており、ＵＲＬ類似度sim
(pi,q) ＝ｎ＋１である（ｎは整数）。図８（ａ）及び
図８（ｂ）それぞれについて（１）式及び（２）式を適
用する。図８（ａ）の場合は以下のようになる。FIG. 8 schematically shows the contents of the expressions (1) and (2). In FIG. 8 as well, as in FIG. 7, each page, an arrow indicates a link relationship, an arrow direction indicates a link direction, and a thickness of the arrow indicates a link weight. Further, FIG.
The shaded circle in indicates a page having a high URL similarity. 8 (a) and 8 (b), page q
Are linked from pages p ₁ , p ₂ and p ₃ . Further, in FIG. 8B, page q is pages p ₁ , p
The URLs of ₂ and p ₃ are similar, and the URL similarity sim
(pi, q) = n + 1 (n is an integer). Equations (1) and (2) are applied to FIGS. 8A and 8B, respectively. The case of FIG. 8A is as follows.

【００７２】各ページのリンクの重み lw(pi,q)＝１／sim(pi,q) ＝１（ＵＲＬが非類似）ページｑのリンク重要度Ｗ_q Ｗ_q＝Ｃ_q＋（Ｗ_p1＋Ｗ_p2＋Ｗ_p3）図８（ｂ）の場合は以下のようになる。Link weight of each page lw (pi, q) = 1 / sim (pi, q) = 1 (URLs are dissimilar) Link importance W _q W _q = C _q + (W _p1 + W of page q) _p2 + _Wp3 ) In the case of FIG.

【００７３】各ページのリンクの重み lw(pi,q)＝１／sim(pi,q) ＝１／（ｎ＋１）（ＵＲＬ
が類似）ページｑのリンク重要度Ｗ_q Ｗ_q＝Ｃ_q＋（Ｗ_p1＋Ｗ_p2＋Ｗ_p3）／（ｎ＋１）故に、図８（ａ）及び８（ｂ）それぞれの計算結果を比
較すると、ＵＲＬ類似度sim(p,q)が高い場合には、被リ
ンク数が多くてもページｑのリンク重要度Ｗ_qが小さく
なることが分かる。延いては、ＵＲＬ類似度を導入する
ことにより、大量のページを抱えるサーバ（サイト）等
が単にページの量が多いというだけで重要度が高いこと
になるという問題を解決していることが分かる。The link weight of each page lw (pi, q) = 1 / sim (pi, q) = 1 / (n + 1) (URL
The link importance of page q W _q W _q = C _q + (W _p1 + W _p2 + W _p3 ) / (n + 1) Therefore, when comparing the calculation results of FIGS. 8A and 8B, the URL is When the similarity sim (p, q) is high, it can be seen that the link importance W _{q of the} page _q is small even if the number of links is large. By introducing URL similarity, it can be understood that the problem that a server (site) having a large number of pages has a high importance simply because the number of pages is large is solved. .

【００７４】次に、（１）式及び（２）式中のページｐ
とｑのＵＲＬ類似度sim(p,q)について説明する。ＵＲＬ
類似度は、リンク重要度付与器２１に備えられたＵＲＬ
類似度計算器２７により算出される。Next, the page p in the equations (1) and (2)
And the URL similarity sim (p, q) between q and q will be described. URL
The similarity is calculated using the URL provided in the link importance level assigning unit 21.
It is calculated by the similarity calculator 27.

【００７５】一般に、ページのＵＲＬは、サーバアドレ
ス、パス、ファイル名の三種類の情報から構成される。
例えば、ＷＷＷページのＵＲＬ、 http://www.flab.fujitsu.co.jp/hypertext/news/1999/
product1.html は、サーバアドレス（www.flab.fujits
u.co.jp）、パス（hypertext/news/1999 ）、ファイル
名（product1.html）の３種類の情報から構成される。Generally, the URL of a page is composed of three types of information: a server address, a path, and a file name.
For example, URL of WWW page, http://www.flab.fujitsu.co.jp/hypertext/news/1999/
product1.html is the server address (www.flab.fujits
u.co.jp), path (hypertext / news / 1999), and file name (product1.html).

【００７６】また、サーバアドレスは、さらに“."によ
り階層化されており、後ろに行くにしたがって、段々広
くなる。例えば、サーバアドレスがwww.flab.fujitsu.c
o.jpであれば、後ろから、日本(jp)、会社(co)、富士通
(fujitsu) 、研究所(flab)、マシン(www) という階層を
表している。Further, the server address is further hierarchized by ".", And becomes wider as it goes backward. For example, if the server address is www.flab.fujitsu.c
o.jp, from behind, Japan (jp), company (co), Fujitsu
(fujitsu), laboratory (flab), machine (www).

【００７７】本実施形態では、与えられた２つのページ
ｐ及びｑのＵＲＬ類似度を、上記の三種類の組合せによ
り定義する。類似度sim(p,q)として、例えば、以下に述
べるドメイン類似度sim ＿domain(p,q) 及び融合類似度
sim ＿merge(p,q)が考えられる。In this embodiment, the URL similarity between two given pages p and q is defined by the above three combinations. As similarity sim (p, q), for example, domain similarity sim_domain (p, q) and fusion similarity described below
sim_merge (p, q) can be considered.

【００７８】ドメイン類似度sim ＿domain(p,q) は、ド
メインの類似に基づいて算出される。ドメインとは、サ
ーバアドレスの後半部分であり、会社や組織を表す。サ
ーバアドレスが.com、.edu、.org等で終わる米国サーバ
の場合はサーバアドレスの後ろから２つめまで、サーバ
アドレスが.jp 、.fr 等で終わる他国のサーバの場合は
サーバアドレスの後ろから３つめまでがドメインに相当
する。例えば、www.fujitsu.com のドメインはfujitsu.
com であり、www.flab.fujitsu.co.jpのドメインはfuji
tsu.co.jp である。The domain similarity sim_domain (p, q) is calculated based on the domain similarity. The domain is the latter half of the server address and represents a company or organization. For servers in the United States ending with .com, .edu, .org, etc., from the end of the server address to the second, and for servers in other countries ending with .jp, .fr, etc., from the end of the server address Up to the third corresponds to a domain. For example, the domain of www.fujitsu.com is fujitsu.
com and the domain of www.flab.fujitsu.co.jp is fuji
tsu.co.jp.

【００７９】ページｐとページｑのドメイン類似度は以
下の（３）式により定義される。 sim ＿domain(p,q) = １／α （ｐ、ｑが同一ドメインの場合） = １（ｐ、ｑが異なるドメインの場合）・・・・（３）ここで、αは定数で、０より大きく１より小さい実数値
を取るとする。図９はインターネットから収集した約３
００万ＵＲＬ間のリンク関係にドメイン類似度sim ＿do
main(p,q) の概念を導入してリンク重要度を計算した場
合を示す。図９において、横軸が、リンク重要度が高い
順から上位何ページかを示す数、縦軸が上位ページの中
に含まれる異なるドメインを持つページの数を示し、系
列１から５は、それぞれ順にα＝０．１、０．２、０．
３、０．５、０．７及び１．０の場合を示す。図９に示
すように、リンク重要度が高い上位１０万件のページ中
に含まれる異なるドメインを持つページの数は、α＝１
の場合（ＵＲＬ類似度を導入しない従来の場合に相当す
る）は４０００件であり、α＝０．１の場合は５５００
件である。従って、αが小さくなるほど、異なるドメイ
ンを持つページがリンク重要度の上位に上がってくるこ
とが分かる。これは、αが小さいほど類似度sim ＿doma
in(p,q) が大きくなり、ＵＲＬ類似度sim ＿domain(p,
q) が大きいほど、リンクの重みlw(p,q) が小さくな
り、延いては、リンク重要度Ｗ_qが小さくなるため、Ｕ
ＲＬ類似度が大きくなるほどページには小さいリンク重
要度が付与されるからである。つまり、sim ＿domain
(p,q) の概念を導入することにより、異なるドメインを
持つページが検索されやすくなっている、言い換える
と、同じドメインを持つページは検索されにくくなって
いることが分かる。The domain similarity between page p and page q is defined by the following equation (3). sim_domain (p, q) = 1 / α (when p and q are in the same domain) = 1 (when p and q are in different domains) (3) where α is a constant and from 0 It is assumed that a real value that is greatly smaller than 1 is taken. Figure 9 shows about 3 collected from the Internet.
Domain similarity sim_do to link relation between million URLs
Here is a case where the concept of main (p, q) is introduced to calculate the link importance. In FIG. 9, the horizontal axis indicates the number of top pages in descending order of link importance, and the vertical axis indicates the number of pages having different domains included in the top pages. Α = 0.1, 0.2, 0.
Cases of 3, 0.5, 0.7 and 1.0 are shown. As shown in FIG. 9, the number of pages having different domains included in the top 100,000 pages having high link importance is α = 1.
(Corresponding to the conventional case where the URL similarity is not introduced) is 4000, and when α = 0.1, 5500
Case. Therefore, it can be seen that as α decreases, pages having different domains rank higher in link importance. This is because the smaller α is, the more similarity sim_doma
in (p, q) increases and the URL similarity sim_domain (p,
q) is larger, the link weight lw (p, q) is smaller and, consequently, the link importance W _q is smaller.
This is because the higher the RL similarity is, the smaller the link importance is given to the page. That is, sim_domain
It can be seen that by introducing the concept of (p, q), pages having different domains are easily searched, in other words, pages having the same domain are hard to be searched.

【００８０】sim(p,q)として、前述の三種類の情報を融
合した類似度sim ＿merge(p,q)を次のように定義する。 sim ＿merge(p,q)＝（サーバアドレスの類似度）＋（パ
スの類似度）＋（ファイル名の類似度）以下、右辺の各項の算出方法について説明する。As sim (p, q), a similarity sim_merge (p, q) obtained by fusing the above three types of information is defined as follows. sim_merge (p, q) = (similarity of server address) + (similarity of path) + (similarity of file name) Hereinafter, a method of calculating each term on the right side will be described.

【００８１】サーバアドレスの類似度は、アドレスの階
層を後ろから見ていき、ｎレベルまで一致した場合、類
似度を１＋ｎとする。例えば、www.fujitsu.co.jp とww
w.flab.fujitsu.co.jpは３レベルまで一致しているので
４となる。www.fujitsu.co.jp とwww.fujitsu.com は１
レベルも一致していないので（一致０レベル）、類似度
は１である。As for the similarity of the server address, the hierarchy of the address is viewed from the rear, and when the levels match up to n levels, the similarity is set to 1 + n. For example, www.fujitsu.co.jp and ww
w.flab.fujitsu.co.jp is 4 because it matches up to 3 levels. www.fujitsu.co.jp and www.fujitsu.com are 1
Since the levels do not match (match 0 level), the similarity is 1.

【００８２】パスの類似度は、先頭からパスの"/" で区
切られた要素毎に比較し、一致したレベルまでを類似度
とする。例えば、/doc/patent/index.htmlと/doc/paten
t/1999/2/file.htmlとは、２レベルまで一致しているの
で類似度は２である。The path similarity is compared for each element delimited by "/" in the path from the head, and the degree of similarity up to the matching level is determined. For example, /doc/patent/index.html and / doc / paten
Since t / 1999/2 / file.html matches up to two levels, the similarity is 2.

【００８３】ファイル名の類似度は、ファイル名が一致
する場合、類似度１とする。上記は、以下のような考え
方に基づいている。１．往々にして、同じような文書を同一ディレクトリに
入れるため、同一サーバでパスも同じＵＲＬは内容が似
ていることが多い。２．アクセスを分散させるために設けられるミラーサイ
トは類似度が高い。これらは、サーバアドレス部分だけ
が異なり、残りのパスやファイル名は同じ場合が多い。３．サーバアドレス、パス、ファイル名が全てことなる
ＵＲＬは、類似度が低い。The file name similarity is set to 1 when the file names match. The above is based on the following concept. 1. Often, similar documents are placed in the same directory, so that URLs with the same path on the same server often have similar contents. 2. Mirror sites provided to distribute access have a high degree of similarity. These differ only in the server address portion, and the remaining paths and file names are often the same. 3. URLs having different server addresses, paths, and file names have low similarity.

【００８４】このsim ＿merge(p,q)によっても、ＵＲＬ
が似通ったページが検索されることを防ぐことができ
る。従って、lw(p,q) の中にsim(p,q)又はdiff(p,q) と
いう概念を導入することにより、大量のページを抱える
サーバ（サイト）や個人が単に量が多いというだけでリ
ンク重要度が高いことになるという従来の問題を解消す
ることができる。The URL can also be obtained by this sim_merge (p, q).
Can be prevented from being searched for pages similar to. Therefore, by introducing the concept of sim (p, q) or diff (p, q) into lw (p, q), only servers (sites) and individuals with a large number of pages have a large amount Thus, the conventional problem that the link importance is high can be solved.

【００８５】なお、上述のリンク重要度Ｗ_pは、後述の
関連度算出にも用いられる。以下、文書検索装置のキー
ワード文書関連度計算器２３が関連度を算出する手順に
ついて説明する。Note that the above-mentioned link importance W _p is also used for calculating the degree of association described later. Hereinafter, a procedure in which the keyword document relevance calculator 23 of the document search device calculates the relevance will be described.

【００８６】まず、キーワードから文書への索引を作る
にあたっては、キーワードと文書への関連度が必要であ
る。関連度については、以下のように考える。１．キーワードを多く含んでいる、つまりキーワードを
含む度合いが高い文書ほど関連度が大きい。２．重要度
の高い文書ほど関連度が大きい。３．あるキーワードの関連文書は一定個数以下であるこ
とが望ましい。（一つのキーワードで、1,000 個もの関
連文書が得られるのはまずい）。First, in creating an index from a keyword to a document, the degree of relevance between the keyword and the document is required. The relevance is considered as follows. 1. A document that includes many keywords, that is, a document that has a high degree of keyword, has a high degree of relevance. 2. The higher the importance of a document, the greater the relevance. 3. It is desirable that the number of related documents of a certain keyword is not more than a certain number. (It's bad for one keyword to get 1,000 related documents.)

【００８７】本実施形態ではあるキーワードの関連文書
は一定個数以内にするために、上記の考え方に加え、以
下の考え方を導入する。４．利用者のアクセスログ解析による関連度：一定期間
内に当該キーワードからよくアクセスされた文書ほど関
連度が高い。５．文書のリンク重要度による関連度：当該キーワード
を含む文書のうち、リンク重要度の高いものは関連度が
高い。In the present embodiment, the following concept is introduced in addition to the above concept in order to keep the number of related documents of a certain keyword within a certain number. 4. Relevance based on user access log analysis: The more frequently a document is accessed from the keyword within a certain period, the higher the relevance. 5. Relevance based on link importance of documents: Among documents including the keyword, those having high link importance have high relevance.

【００８８】上述の考え方を導入して、あるキーワード
ｗに対してページｐの関連度を以下の（４）式で与え
る。Ｒel(p,w) ＝ＴＦ(p,w)*log Ｗ_p*log( ＡＣ(p,w) ＋２）・・・（４）ここで、ＴＦ（p,w)はｐにおけるｗの出現個数、Ｗ_pは
ｐのリンク重要度であり、上述の（２）式のＷ_pに相当
する。By introducing the above idea, the relevance of the page p with respect to a certain keyword w is given by the following equation (4). Rel (p, w) = TF (p, w) * log _Wp * log (AC (p, w) +2) (4) where TF (p, w) is the number of occurrences of w in p , W _p is the link importance of _p and corresponds to W _p in the above equation (2).

【００８９】ＡＣ(p,w）は、一定期間（例えば１月間又
は１週間）の間にキーワードｗからｐにアクセスされた
回数である。各キーワードに対して、上記Ｒel(p,w) の
値の大きいものから一定個数をそのキーワードの関連ペ
ージとする。AC (p, w) is the number of times that the keyword w was accessed from the keyword w during a certain period (for example, one month or one week). For each keyword, a certain number from the largest value of Rel (p, w) is set as a related page of the keyword.

【００９０】キーワードを含む度合いに加えて、ＵＲＬ
類似度という概念を導入したリンク重要度Ｗ_P及び利用
者のアクセスログ解析を関連度の算出に用いている。こ
れにより、ページの関連度を高くするための条件が多く
なるため、意図的に第三者があるページの関連度を高く
するようにページ内容を操作することはさらに困難にな
る。In addition to the degree including the keyword, the URL
The link importance W _P and the user's access log analysis that introduce the concept of similarity are used for calculating the relevance. As a result, conditions for increasing the relevance of the page are increased, so that it is more difficult to intentionally manipulate the page contents to increase the relevance of a certain page.

【００９１】以下、索引生成器２４が生成する索引、つ
まり、検索のためのキーワードを選択するインターフェ
ースについて説明する。本実施形態の選択インターフェ
ースは、キーワードの読みの断片を順次クリックするだ
けでキーワードに辿りつけるインタフェースであり、以
下の特徴がある。・一画面には、キーワードの読みの断片（文字または文
字列）と、それまで選択したキーワード読みに相当する
キーワードの一部が表示される。・利用者は、画面上のキーワードの読み断片（文字また
は文字列）を次々クリックすることで、キーワードに辿
りつける。・１画面に表示されるキーワードは、一定個数以内に制
限することができる。Hereinafter, an index generated by the index generator 24, that is, an interface for selecting a keyword for search will be described. The selection interface according to the present embodiment is an interface that allows the user to reach the keyword by simply clicking the reading fragments of the keyword sequentially, and has the following features. On one screen, a fragment (character or character string) of the keyword reading and a part of the keyword corresponding to the keyword reading selected so far are displayed. -The user can click on the reading fragment (character or character string) of the keyword on the screen one after another to reach the keyword.・ Keywords displayed on one screen can be limited to a certain number.

【００９２】前述の従来技術では、常に１「文字」をク
リックして読みを選択しているが、本実施形態では必ず
しも文字でなく、「文字列」の場合もあり得る。これに
より、キーワードに辿りつくまでの無駄なクリックを減
らすことが可能である。さらに、１画面に表示されるキ
ーワードは一定個数以内であるため、キーワードを選択
することが用意となる。また、１画面に表示されるキー
ワードは一定個数以内であることは、表示装置の画面の
大きさが制限されている携帯電話のような移動端末での
キーワード選択に有用である。これらを実現するため
に、索引生成器２４において以下のようにしている。１．キーワードの読み（またはスペル）の統一。場合に
よっては「ー」（長音）の除去や「ぃ」のような拗撥音
を「あ」「い」のような表記に統一する。２．キーワードとその読み（またはスペル）から、読み
文字をノード、キーワード集合をリーフとする有向グラ
フ（文字列グラフ）を作る。３．上記グラフで、以下の縮退操作を行う（下を参
照）。In the above-described prior art, one "character" is always clicked to select the reading, but in the present embodiment, it is not necessarily a character but may be a "character string". Thereby, it is possible to reduce useless clicks until reaching the keyword. Further, since the number of keywords displayed on one screen is within a certain number, it is ready to select a keyword. The fact that the number of keywords displayed on one screen is within a certain number is useful for keyword selection in a mobile terminal such as a mobile phone in which the size of the screen of the display device is limited. To achieve these, the index generator 24 performs the following. 1. Unified keyword reading (or spelling). In some cases, the removal of "-" (long sound) and unsounding sounds such as "ぃ" are unified to "a" and "i". 2. From the keywords and their readings (or spellings), a directed graph (character string graph) is created with the reading characters as nodes and the keyword set as leaves. 3. Perform the following degeneration operation in the above graph (see below).

【００９３】(a) リーフへのパスの縮退 (b) 中間パスの除去 (c) 子ノードのキーワードを親ノードにくりこみ、子ノ
ードの削除４．縮退後のグラフを元に、読み入力インタフェースを
作成する。(A) Degeneration of a path to a leaf (b) Removal of an intermediate path (c) Renormalizing the keyword of a child node into the parent node and deleting the child node Create a reading input interface based on the reduced graph.

【００９４】以下、索引生成器２４においてキーワード
文字列グラフ作成以下の手順をどのように行っているか
説明する。キーワード文字列グラフとは、キーワードの
読みを表現する有向ラベル付きグラフである。図１０
は、キーワード文字列グラフの一例を示す。Hereinafter, a description will be given of how the index generator 24 performs the following steps for creating a keyword character string graph. The keyword character string graph is a graph with a directed label expressing the reading of a keyword. FIG.
Shows an example of a keyword character string graph.

【００９５】キーワード文字列グラフは、（Ｎ，Ｃ，ＫＷ，ｔ，ｎｋ，ｙｏｍｉ）の６つ組で表現される。ここで、Ｎはノードの集合、Ｃはかな文字の集合、ＫＷはキーワードの集合、ｔはＮ＊Ｃ⁺→Ｎノード遷移関数。また、Ｃ⁺はラベ
ル、つまり、１以上のかな文字の並び（図９（ａ）等の
文字列グラフでは実線矢印で示される）ｎｋはＮ→Ｗ⁺ノードに割り当てられたキーワード（図
９（ａ）等では点線で示される）ｙｏｍｉはＮ→Ｃ⁺ノードの「読み」である。The keyword character string graph is represented by a set of six (N, C, KW, t, nk, yomi). Here, N is a set of nodes, C is a set of kana characters, KW is a set of keywords, and t is N * C ⁺ → N node transition function. Also, C ⁺ is a label, that is, an arrangement of one or more kana characters (indicated by a solid arrow in the character string graph of FIG. 9A and the like), and nk is a keyword assigned to the N → W ⁺ node (FIG. ami etc. are indicated by dotted lines) yomi is “reading” of N → C ⁺ node.

【００９６】図１０（ａ）を例に取ると以下のようにな
る。なお、ｙｏｍｉは自明なので省略）Ｎ＝｛トップ、「あ」，「あい」，「あいぼ」，「あい
ぼう」，「あいぼり」，「あお」、「あおぞ」、「あお
ぞら」｝Ｃ＝｛' あ' ，．．．，' ん' ｝ＫＷ＝｛' 青' ，' 蒼' ，' 青空' ，' アイボリー' ｝ｔ（トップ，あ）＝「あ」，ｔ（あ，い）＝「あい」，
ｔ（あ，お）＝「あお」，ｔ（あい，ぼ）＝「あい
ぼ」，ｔ（あいぼ，う）＝「あいぼう」，ｔ（あいぼ，
り）＝「あいぼり」，ｔ（あお，ぞ）＝「あおぞ」，ｔ
（あおぞ，ら）＝「あおぞら」，ｎｋ（あいぼう）＝｛' 相棒' ｝ｎｋ（あいぼり）＝｛' アイボリー' ｝ｎｋ（あお）＝｛' 青''蒼' ｝ｎｋ（あおぞら）＝｛' 青空' ｝索引生成器２４はキーワードと読みが与えられた場合、
それらに基づいて初期キーワード文字列グラフを生成す
る。図１１は初期キーワード文字列グラフの生成手順を
示す。図１１を用いて索引生成器２４における初期キー
ワード文字列グラフの生成手順を説明する。なお、図１
２は、初期キーワード文字列グラフの生成手順を実現す
るアルゴリズムの一例を示す。Taking FIG. 10A as an example, the following is obtained. Note that yomi is self-evident and will be omitted.) N = ｛Top, “A”, “Ai”, “Aibo”, “Aibo”, “Aibori”, “Ao”, “Aozo”, “Aozora”｝ C = ｛'A',. . . , 'N''KW = ｛' blue ',' blue ',' blue sky ',' ivory 'ｔ t (top, a) = “a”, t (a, i) = “a”,
t (Aio) = "Ao", t (Aibo) = "Aibo", t (Aibo, U) = "Aibo", t (Aibo,
Ri) = "aibori", t (blue, zo) = "blue", t
(Aozora) = "Aozora", nk (Aibou) = ｛'Aibo'｝ nk (Aibori) = ｛'Ivory'｝ nk (Ao) = ｛'blue' blue 'ｎ NK (Aozora) = ｛'Blue sky' 索引 index generator 24, given keyword and reading,
Based on them, an initial keyword character string graph is generated. FIG. 11 shows a procedure for generating an initial keyword character string graph. A procedure for generating the initial keyword character string graph in the index generator 24 will be described with reference to FIG. FIG.
2 shows an example of an algorithm for realizing a procedure for generating an initial keyword character string graph.

【００９７】まず、キーワードの集合ＫＷを作成する
（ステップＳ１１）。次に、作成された集合ＫＷ内が空
か否か判定する（ステップＳ１２）。集合ＫＷ内が空で
ある場合（ステップＳ１２、ｙｅｓ）、文字列を作成す
る必要はないため、処理を終了する。集合ＫＷ内が空で
ない場合（ステップＳ１２、ｎｏ）、次のステップに進
む。First, a set KW of keywords is created (step S11). Next, it is determined whether or not the created set KW is empty (step S12). If the inside of the set KW is empty (step S12, yes), there is no need to create a character string, and the process ends. If the inside of the set KW is not empty (step S12, no), the process proceeds to the next step.

【００９８】次に、集合ＫＷからあるキーワードｕを抜
き出す（ステップＳ１３）。そして、このキーワードｕ
の読みｙｏｍｉ（ｕ）及び、この読みｙｏｍｉ（ｕ）の
ノードであるｎｋ｛ｙｏｍｉ（ｕ）｝を設定し、このノ
ードｎｋ｛ｙｏｍｉ（ｕ）｝を末端ノードとして追加す
る（ステップＳ１４）。Next, a certain keyword u is extracted from the set KW (step S13). And this keyword u
Is set, and a node nk {yomi (u)} of the read yomi (u) is set, and the node nk {yomi (u)} is added as a terminal node (step S14).

【００９９】ステップＳ１４の処理をキーワードｕの文
字列長分繰り返したか否か、つまり、キーワードｕが空
になったか否か判定する（ステップＳ１５）。キーワー
ドｕが空になった場合（ステップＳ１５、ｙｅｓ）、キ
ーワードｕについては必要な処理は終了したため、ステ
ップＳ１２へ戻り集合ＫＷから別のキーワードをとって
ステップＳ１３以下の手順を繰り返す。キーワードｕが
空になっていない場合（ステップＳ１５、ｎｏ）、キー
ワードｕの末尾文字を切り取る（ステップＳ１６）。続
いて、ノードを１つ前の親ノードに遷移して追加する
（ステップＳ１７）。さらにキーワードｕの切り取られ
た末尾文字の前の文字に注目し（ステップＳ１８）、ス
テップＳ１５に戻る。It is determined whether or not the processing of step S14 has been repeated for the character string length of the keyword u, that is, whether or not the keyword u has become empty (step S15). When the keyword u becomes empty (step S15, yes), the necessary processing for the keyword u has been completed, so the process returns to step S12, another keyword is taken from the set KW, and the procedure from step S13 is repeated. If the keyword u is not empty (step S15, no), the last character of the keyword u is cut out (step S16). Subsequently, the node is shifted to the immediately preceding parent node and added (step S17). Furthermore, attention is paid to the character before the cut-off last character of the keyword u (step S18), and the process returns to step S15.

【０１００】この手順により、集合ｎｋとしてノードに
割り当てられたキーワードリストが得られ、ｔとしてあ
るノードの下位に位置する下位ノードのリストが得られ
る。図１０（ａ）は上述の手続により作成された初期キ
ーワード文字列グラフを示す。図１０（ａ）は、蒼：あお、青：あお、青空：あおぞら、相棒：あいぼう、アイボリー：あいぼりというキーワードと読みデータから生成した初期キーワ
ード文字列グラフである。また、同様に、図１０（ａ）
は図１２に示すinit＿kw＿graph() アルゴリズムにおい
て @KW ＝｛蒼、青、青空、相棒、アイボリー｝ yomi｛蒼｝＝あお、yomi｛青｝＝あお、yomi｛青空｝＝
あおぞら、yomi｛相棒｝＝あいぼう、yomi｛アイボリ
ー｝＝あいぼり、としたものである。According to this procedure, a keyword list assigned to a node is obtained as a set nk, and a list of lower nodes located below a certain node is obtained as t. FIG. 10A shows an initial keyword character string graph created by the above procedure. FIG. 10A is an initial keyword character string graph generated from the keywords Ao: Ao, Blue: Ao, Aozora: Aozora, Aibo: Aibo, Ivory: Aibo, and reading data. Similarly, FIG.
In the init_kw_graph () algorithm shown in FIG. 12, @KW = {blue, blue, blue sky, buddy, ivory} yomi blue = blue, yomi blue = blue, yomi blue =
Aozora, yomi ｛buddy｝ = aibo, yomi ｛ivory｝ = aibo.

【０１０１】索引生成器２４は、初期キーワード文字列
グラフを作成すると、次に、文字列の縮退を行う。以
下、文字列の縮退について説明する。縮退は、１．中間ノードの縮退２．末端ノードのくりこみの２種類の操作からなる。After creating the initial keyword character string graph, the index generator 24 degenerates the character string. Hereinafter, degeneration of a character string will be described. Degeneration is: 1. Degeneration of intermediate nodes It consists of two kinds of operations, renormalization of terminal nodes.

【０１０２】まず、索引生成器２４における中間ノード
の縮退処理について説明する。図１３は中間ノードの縮
退処理の手順を示す。以下、図１３を用いて中間ノード
の縮退処理の手順について説明する。なお、図１４は、
中間ノードの縮退処理を実現するアルゴリズムの一例を
示す。First, the degeneracy process of the intermediate node in the index generator 24 will be described. FIG. 13 shows the procedure of the degeneration process of the intermediate node. Hereinafter, the procedure of the intermediate node degeneration processing will be described with reference to FIG. In addition, FIG.
4 shows an example of an algorithm for realizing an intermediate node degeneration process.

【０１０３】まず、ノードの集合Ｎを作成する（ステッ
プＳ２１）。続いて、集合Ｎ内が空か否か判定する（ス
テップＳ２２）。集合Ｎ内が空である場合（ステップＳ
２２、ｙｅｓ）、ノードを縮退させる必要はないため、
処理を終了する。集合Ｎ内が空でない場合（ステップＳ
２２、ｎｏ）、集合Ｎ内のノードｎを取得する（ステッ
プＳ２３）。取得したノードｎについて、そのノードｎ
に続く後継ノードが１つしかなく、かつノードｎ内にキ
ーワードもないという２つの条件を満たすか否か判定す
る（ステップＳ２４）。２つの条件を満たす場合（ステ
ップＳ２４、ｙｅｓ）、ノードｎは縮退可能であるた
め、ノードｎをキーワード文字列グラフから削除し、ス
テップＳ２２に戻る（ステップＳ２５）。２つの条件の
両方を満たさない場合（ステップＳ２４、ｎｏ）、ノー
ドｎは縮退できないので、ノードｎを削除せずにステッ
プＳ２２に戻る。First, a set N of nodes is created (step S21). Subsequently, it is determined whether or not the inside of the set N is empty (step S22). If the set N is empty (step S
22, yes), since there is no need to degenerate the nodes,
The process ends. If the set N is not empty (step S
22, no), and acquires the node n in the set N (step S23). For the acquired node n, the node n
It is determined whether or not there is only one successor node following, and there is no keyword in node n (step S24). If the two conditions are satisfied (step S24, yes), node n is deleted from the keyword character string graph because node n can be degenerated, and the process returns to step S22 (step S25). If both of the two conditions are not satisfied (step S24, no), the process returns to step S22 without deleting the node n because the node n cannot be degenerated.

【０１０４】上述のようにしてキーワード文字列グラフ
において、「ノードに割り当てられたキーワードがな
い」及び「後継ノード（子ノード）が１つのみ」である
という２つの条件を満たす中間ノードは縮退させること
としている。図１０（ａ）の初期キーワード文字列グラ
フでは、ノード「あい」及びノード「あおぞ」が、「ノ
ードに割り当てられたキーワードがない」及び「後継ノ
ード（子ノード）が１つのみ」であるという２つの条件
を満たす中間ノードに相当する。図１０（ｂ）は、図１
０（ａ）に示す初期キーワード文字列グラフの中間ノー
ドを縮退させた結果を示す。図１０（ｂ）において、中
間ノード「あい」及び「あおぞ」は削除されている。ま
た、同様に、図１４に示すproc＿shrink＿middle()アル
ゴリズムにおいて、ｔ｛''｝＝あ＋ｔ｛あ｝＝あいぼ＋あお＋ｔ｛あいぼ｝＝あいぼう＋あいぼり＋ｔ｛あお｝＝あおぞ＋ｎｋ｛あいぼう｝＝相棒＋ｎｋ｛あいぼり｝＝アイボリ＋ｎｋ｛あお｝＝青＋蒼＋ｎｋ｛あおぞら｝＝青空＋となる。As described above, in the keyword character string graph, an intermediate node satisfying the two conditions of “there is no keyword assigned to the node” and “only one successor node (child node)” is degenerated. I have to do that. In the initial keyword character string graph of FIG. 10A, the nodes “Ai” and “Aozo” are “there is no keyword assigned to the node” and “there is only one successor node (child node)”. Intermediate node that satisfies these two conditions. FIG. 10B shows FIG.
The result of degenerating the intermediate nodes of the initial keyword character string graph shown in FIG. In FIG. 10B, the intermediate nodes “Ai” and “Aozo” have been deleted. Similarly, in the proc_shrink_middle () algorithm shown in FIG. 14, t 、 ″｝ = a + t あ a ｛= aibo + ao + t ＋ aibo ｛= aibo + aibori + t ｛aoi｝ = Blue + nk {blue} = partner + nk blue = ivory + nk blue = blue + blue + blue blue = blue sky +

【０１０５】次に、索引生成器２４における末端ノード
の縮退処理について説明する。図１５は末端ノードの縮
退処理の手順を示す。以下、図１５を用いて末端ノード
の縮退処理の手順について説明する。なお、図１６は、
末端ノードの縮退処理を実現するアルゴリズムの一例を
示す。Next, the degeneracy process of the terminal node in the index generator 24 will be described. FIG. 15 shows the procedure of the terminal node degeneration processing. Hereinafter, the procedure of the terminal node degeneration processing will be described with reference to FIG. In addition, FIG.
5 shows an example of an algorithm for realizing a terminal node degeneration process.

【０１０６】まず、全てのノードの集合Ｎを作成する
（ステップＳ３１）。次に、ノードをそのノードに属す
るキーワード数の順にソートする（ステップＳ３２）。
整数ｉを１に設定する（ステップＳ３３）。次に、整数
ｉが集合Ｎ内のノードの数よりも少ないか否か判定する
（ステップＳ３４）。整数ｉが集合Ｎ内のノードの数よ
りも少ないのではない場合（ステップＳ３４、ｎｏ）、
さらに、末端ノードの縮退があるか否か判定する（ステ
ップＳ３５）。末端ノードの縮退がない場合（ステップ
Ｓ３５、ｎｏ）、処理を終了する。末端ノードの縮退が
ある場合（ステップｓ３５、ｙｅｓ）、ステップＳ３３
に戻る。First, a set N of all nodes is created (step S31). Next, the nodes are sorted in the order of the number of keywords belonging to the node (step S32).
The integer i is set to 1 (step S33). Next, it is determined whether the integer i is smaller than the number of nodes in the set N (step S34). If the integer i is not less than the number of nodes in the set N (step S34, no),
Further, it is determined whether or not the terminal node is degenerated (step S35). If there is no degeneracy of the terminal node (step S35, no), the process ends. If there is degeneration of the terminal node (step s35, yes), step S33
Return to

【０１０７】整数ｉが集合Ｎ内のノードの数よりも少な
い場合（ステップＳ３４、ｙｅｓ）、集合Ｎのｉ番目の
ノードｎを取得する（ステップＳ３６）。取得されたノ
ードｎが末端ノードであるか否か判定する（ステップＳ
３７）。取得されたノードｎが末端ノードである場合
（ステップＳ３７、ｙｅｓ）、ステップＳ３８に進む。
取得されたノードｎが末端ノードでない場合（ステップ
Ｓ３７、ｎｏ）、ノードｎは縮退すべきノードではない
ため、整数ｉを１つインクリメントし（ステップＳ４
１）、ステップＳ３４に戻る。If the integer i is smaller than the number of nodes in the set N (step S34, yes), the ith node n of the set N is obtained (step S36). It is determined whether or not the acquired node n is a terminal node (step S
37). If the acquired node n is a terminal node (step S37, yes), the process proceeds to step S38.
If the acquired node n is not a terminal node (step S37, no), the node i is not a node to be degenerated, so the integer i is incremented by one (step S4).
1) Return to step S34.

【０１０８】次に、ノードｎの親ノードｐを取得する
（ステップＳ３８）。次に、親ノードｐに属するキーワ
ードの数と子ノードｎに属するキーワードの数の和が予
め与えられたキーワードの数の上限を超えるか否か判定
する（ステップＳ３９）。Next, the parent node p of the node n is obtained (step S38). Next, it is determined whether or not the sum of the number of keywords belonging to the parent node p and the number of keywords belonging to the child node n exceeds a predetermined upper limit of the number of keywords (step S39).

【０１０９】親ノードｐに属するキーワードの数と子ノ
ードｎに属するキーワードの数の和が上限値を超えない
場合（ステップＳ３９、ｙｅｓ）、子ノードｎを削除し
（縮退させる）、ノードｎに属するキーワードを親ノー
ドｐに属するキーワードとする（ステップＳ４０）。続
いて、整数ｉを１つインクリメントして（ステップＳ４
１）、ステップＳ３４へ戻る。If the sum of the number of keywords belonging to the parent node p and the number of keywords belonging to the child node n does not exceed the upper limit (step S39, yes), the child node n is deleted (degenerated), and The belonging keyword is set as the keyword belonging to the parent node p (step S40). Subsequently, the integer i is incremented by one (step S4).
1) Return to step S34.

【０１１０】親ノードｐに属するキーワードの数と子ノ
ードｎに属するキーワードの数の和が上限値を超える場
合、子ノードｎを縮退させると親ノードのキーワード数
が多すぎることになるため、子ノードｎを縮退させず
に、ステップＳ４１に進む。When the sum of the number of keywords belonging to the parent node p and the number of keywords belonging to the child node n exceeds the upper limit, if the child node n is degenerated, the number of keywords of the parent node becomes too large. The process proceeds to step S41 without degenerating the node n.

【０１１１】このようにして末端ノードにあるキーワー
ド情報を、一つ上の親ノードに移すことにより、木（ノ
ードの連鎖）の深さを減らし、利用者が少ないクリック
でキーワードに到達できるようになる。しかし、あまり
沢山親ノードに子ノードのキーワードを移しすぎては、
一つのノードに大量のキーワードが割り当てられること
になるので、利用者がその中からキーワードを選択する
ことが面倒になってしまう。この問題を回避するため、
パラメータwords ＿max を与え、１ノードに属するキー
ワードがそのパラメータwords ＿max より少なくなるよ
うにする。By moving the keyword information at the terminal node to the parent node immediately above, the depth of the tree (chain of nodes) is reduced, so that the user can reach the keyword with few clicks. Become. However, if you transfer too many child node keywords to the parent node,
Since a large amount of keywords are assigned to one node, it becomes troublesome for the user to select a keyword from among them. To work around this problem,
Given a parameter words_max, the number of keywords belonging to one node is less than the parameter words_max.

【０１１２】図１０（ｃ）は、図１０（ｂ）のキーワー
ド文字列グラフをパラメータwords＿max ＝４の場合に
末端ノード縮退処理した結果を示す。図１０（ｂ）にお
いて、末端ノード「あいぼう」及び「あいぼり」はそれ
ぞれ１つのキーワードしか持たない。末端ノード「あい
ぼう」及び「あいぼり」の親ノード「あいぼ」は２つの
子ノードを持ち、キーワードはない。従って、親ノード
「あいぼ」と子ノード「あいぼう」及び「あいぼり」の
キーワードの和はwords ＿max ＝４より少ない。故に、
子ノード「あいぼう」及び「あいぼり」は縮退可能であ
るため、図１０（ｃ）において子ノード「あいぼう」及
び「あいぼり」は削除され、これら子ノードのキーワー
ドは親ノード「あいぼ」に移されている。また、同様
に、図１６に示すproc＿shrink＿leaf()アルゴリズムに
おいて、ｔ｛''｝＝あ＋ｔ｛あ｝＝あいぼ＋あお＋ｔ｛あお｝＝あおぞ＋ｎｋ｛あいぼ｝＝相棒＋アイボリー＋ｎｋ｛あお｝＝青＋蒼＋ｎｋ｛あおぞら｝＝青空＋となる。FIG. 10 (c) shows the result of the terminal node degeneration processing of the keyword character string graph of FIG. 10 (b) when the parameter words_max = 4. In FIG. 10B, the terminal nodes “Aibo” and “Aibo” each have only one keyword. The parent node "Aibo" of the terminal nodes "Aibo" and "Aibo" has two child nodes and has no keyword. Therefore, the sum of the keywords of the parent node “Aibo” and the child nodes “Aibo” and “Aibo” is less than words_max = 4. Therefore,
Since the child nodes “Aibo” and “Aiboro” can be degenerated, the child nodes “Aibo” and “Aibo” are deleted in FIG. 10C, and the keywords of these child nodes are the parent node “Aibo”. Has been moved. Similarly, in the proc_shrink_leaf () algorithm shown in FIG. 16, t｝ ″｝ = a + t ｛a｝ = aibo + ao + t ｛aoi｝ = aozo + nk ｛ai｝ = a partner + ivory + Nk {blue} = blue + blue + nk {blue} = blue sky +

【０１１３】図１７は、末端ノード縮退処理をしたキー
ワード文字列グラフの更なる一例を示す。図１７におい
て、親ノード「かいせ」には３つの末端ノード「かいせ
い」、「かいせき」及び「かいせつ」がある。末端ノー
ド「かいせい」及び「かいせき」に属するキーワードは
それぞれ「快晴」及び「解析」であるため、末端ノード
「かいせい」及び「かいせき」は縮退可能である。ま
た、末端ノード「かいせつ」に属するキーワードは「解
説」及び「開設」の２つであるため、末端ノード「かい
せつ」も縮退可能である。従って、図１７の場合は２通
りの縮退の方法が考えられる。しかし、前者の方が総ノ
ード数を減らすことができる。本実施形態において、末
端ノードをそのノードに属するキーワード数の順にソー
トしている。これにより、前者のような効率的な縮退を
することを可能にしている。FIG. 17 shows another example of a keyword character string graph that has undergone terminal node degeneration processing. In FIG. 17, the parent node “Kaise” has three terminal nodes “Kaisei”, “Kaisei”, and “Kaisei”. Since the keywords belonging to the terminal nodes “kaisei” and “kaisei” are “clear” and “analysis”, respectively, the terminal nodes “kaisei” and “kaisei” can be degenerated. Further, since the keywords belonging to the terminal node “Kaisesetsu” are two, “commentary” and “open”, the terminal node “Kaisesetsu” can also be reduced. Therefore, in the case of FIG. 17, two methods of degeneration are conceivable. However, the former can reduce the total number of nodes. In the present embodiment, the terminal nodes are sorted in the order of the number of keywords belonging to the nodes. This enables efficient degeneration as in the former case.

【０１１４】次に、索引生成器が作成する索引の例につ
いて図１８から図２５を用いて説明する。図１８は索引
トップ画面から索引中間画面及びキーワード情報画面を
経て分所ページに至るまでの索引画面の遷移を示す。図
１８を用いて表示装置に表示される索引画面の遷移につ
いて説明する。図１８において、まず、索引トップ画面
が表示される。利用者が索引トップ画面からキーワード
の読み又はスペルの最初の部分を選択すると、索引中間
画面に遷移する。利用者が索引中間画面からキーワード
の読み又はスペルの次の部分を選択すると、さらなる索
引中間画面に遷移する。この選択を繰り返した結果、利
用者が検索を希望するキーワードに辿りつき、そのキー
ワードを選択すると、キーワード情報画面に遷移する。
利用者がその他のキーワードを選択すると、さらなるキ
ーワード情報画面に遷移する。利用者が閲覧を希望する
文書ページのタイトルを選択すると、その文書ページへ
リンクし、その文書ページに遷移する。なお、利用者の
選択操作は、マウスによるクリックやペン等のポインテ
ィングにより行うようにすることも可能である。各画面
は、例えばハイパーテキストで作成されてもよい。Next, an example of an index created by the index generator will be described with reference to FIGS. FIG. 18 shows the transition of the index screen from the index top screen to the branch page through the index intermediate screen and the keyword information screen. The transition of the index screen displayed on the display device will be described with reference to FIG. In FIG. 18, first, an index top screen is displayed. When the user selects the key word reading or the first part of the spelling from the index top screen, the screen transitions to the index intermediate screen. When the user selects the next part of the reading or spelling of the keyword from the index intermediate screen, the screen transitions to a further index intermediate screen. As a result of repeating this selection, the user arrives at the keyword desired to be searched, and when the user selects the keyword, the screen is switched to a keyword information screen.
When the user selects another keyword, the screen transitions to a further keyword information screen. When the user selects the title of the document page that he / she wants to view, the user links to the document page and transits to the document page. The user's selection operation can be performed by clicking with a mouse or pointing with a pen or the like. Each screen may be created by, for example, hypertext.

【０１１５】図１９は索引トップ画面の一例を示す。索
引トップ画面には、図４に示す索引情報テーブル６１の
トップ以下の文字（列）が表示される。図１９におい
て、５０音、アルファベット及び０から９までの数字が
表示されている。利用者は検索したいキーワードの読み
またはスペルの最初をクリックすることで先の画面に進
むことができる。FIG. 19 shows an example of the index top screen. On the index top screen, characters (strings) below the top of the index information table 61 shown in FIG. 4 are displayed. In FIG. 19, the Japanese syllabary, the alphabet, and the numbers from 0 to 9 are displayed. The user can go to the next screen by clicking on the reading or spelling of the keyword to be searched.

【０１１６】図２０は索引トップ画面のさらなる例を示
す。図２０において、キーワードの読み付けをする際に
表記統一した又は・及びノードの縮退をした結果、例え
ば、５０音の「だ」行の「ぢ」及び「づ」が索引から除
かれていることが分かる。同様に、アルファベットの
「Ｙ」及び「Ｚ」が除かれていること等が分かる。FIG. 20 shows a further example of the index top screen. In FIG. 20, as a result of unifying notation and / or degeneracy of nodes when reading keywords, for example, "ぢ" and "Zu" in the "Da" line of the Japanese syllabary are removed from the index. I understand. Similarly, it can be seen that the letters “Y” and “Z” have been removed.

【０１１７】図２１は索引中間画面の一例を示す。図２
１は、索引トップ画面において「あ」を選択した場合の
索引中間画面である。上半分は「あ」の後に継続する文
字列、下半分にはその他のキーワードが表示されてい
る。索引中間画面は、図４の索引情報テーブル６１及び
図３のキーワード対応テーブル５１（キーワードから、
キーワードＩＤを得る）から生成できる。FIG. 21 shows an example of the index intermediate screen. FIG.
Reference numeral 1 denotes an index intermediate screen when “A” is selected on the index top screen. The upper half shows the character string that follows "a", and the lower half shows other keywords. The index intermediate screen includes an index information table 61 of FIG. 4 and a keyword correspondence table 51 of FIG.
Keyword ID).

【０１１８】図２１において、「あ」に継続する文字列
として「いぼ」、「え」「お」などがある。例えば、
「いぼ」を選択すると「あいぼ」に移る。その他のキー
ワードとして、「愛」「愛犬」などが一定個数（例えば
20以内）表示されている「あ」から始まっていて、継続
する読みが選べないキーワードは全てここに表示されて
いる。逆にキーワードも表示されず、継続する読みもな
い場合には、そのキーワードは索引に含まれていないこ
とがわかる。In FIG. 21, character strings following "a" include "wart", "e", "o", and the like. For example,
When "Ibo" is selected, it moves to "Aibo". Other keywords such as "love" and "love dog"
Within 20) All keywords that start with the displayed "A" and cannot be read continuously are displayed here. Conversely, if no keyword is displayed and there is no continuous reading, it is understood that the keyword is not included in the index.

【０１１９】図２２は、索引中間画面の更なる例を示
す。図２２は、索引トップ画面において「い」を選択し
た場合の索引中間画面である。上半分は「い」の後に継
続する文字列、下半分にはその他のキーワードが表示さ
れている。図２２において、「あ」に継続する文字列と
して「いぼ、「えろ」などがある。その他のキーワード
として、「イオン」「イネーブル」などが一定個数（例
えば20以内）表示されている。FIG. 22 shows a further example of the index intermediate screen. FIG. 22 is an index intermediate screen when “i” is selected on the index top screen. The upper half shows the character string that follows "i", and the lower half shows other keywords. In FIG. 22, a character string that follows "A" includes "wart, ero" and the like. As other keywords, "ion", "enable", and the like are displayed in a fixed number (for example, within 20).

【０１２０】図２３は、索引中間画面のさらなる例を示
す。図２３において、「いべんと」が選択された場合の
索引中間画面が示されている。ノード「イベント」の子
ノードはないため、「いべんと」の後に継続する文字列
は表示されず、キーワードが表示されている。利用者
は、この画面から入力したいキーワードを選択する。こ
の画面に含まれていない「いべんと」を含むキーワード
は索引に含まれていないことが分かる。FIG. 23 shows a further example of the index intermediate screen. FIG. 23 shows an index intermediate screen when “event” is selected. Since there is no child node of the node "event", a character string continuing after "event" is not displayed, but a keyword is displayed. The user selects a keyword to be input from this screen. It can be seen that the keywords including "event" not included in this screen are not included in the index.

【０１２１】従来技術において読みは１文字ずつしか選
択できないため、長い文字列を入力するためには何度も
選択操作を繰り返す必要があった。しかし、本実施形態
によれば、ノードを縮退させるため、継続する文字列を
必ずしも１文字ずつ選択する必要はない。例えば図２１
の文字列「いぼ」又は図２２の文字列「えろ」のよう
に、２文字を一度に選択することも可能である。従っ
て、利用者が選択する回数を低減することができる。In the prior art, since only one character can be selected for reading, it is necessary to repeat the selection operation many times to input a long character string. However, according to the present embodiment, it is not always necessary to select a continuous character string one by one in order to degenerate a node. For example, FIG.
It is also possible to select two characters at a time, such as the character string "wart" of FIG. Therefore, the number of times the user selects can be reduced.

【０１２２】また、さらに、継続する読みが選べないキ
ーワードは全てここに表示されている。逆にキーワード
も表示されず、継続する読みもない場合には、そのキー
ワードは索引に含まれていないため、利用者が１文字づ
つ選択していって最後の１文字をするときになって初め
て入力しようとしていたキーワードが索引に含まれてい
ないことが分かるといった問題は生じないことになる。Furthermore, all keywords for which continuous reading cannot be selected are displayed here. Conversely, if the keyword is not displayed and there is no continuous reading, the keyword is not included in the index, so the user only selects one character at a time and performs the last character. There will be no problem that the keyword to be input is found not to be included in the index.

【０１２３】また、さらに、末端ノードの縮退の際に、
パラメータword＿max として設定された値以上のキーワ
ードは表示されないため、利用者はキーワードの選択の
際に表示されたキーワードから選択したいキーワードを
探し出すことが比較的楽におかなうことができる。ま
た、表示されるキーワードが一定個数以内にかぎられて
いることは、表示画面の大きさが限られた携帯電話のよ
うな移動端末におけるキーワード選択において有用であ
る。Further, when the terminal node is degenerated,
Since a keyword having a value greater than the value set as the parameter word_max is not displayed, it is relatively easy for the user to search for a desired keyword from the displayed keywords when selecting a keyword. In addition, the fact that the displayed keywords are limited to a certain number or less is useful in keyword selection in a mobile terminal such as a mobile phone having a limited display screen size.

【０１２４】さらにまた、検索のインタフェースとし
て、決まったキーワードをセットに、なるべく少ない労
力で辿り着かせることを可能にし、従来技術におけるか
な漢字変換とは次のような点が異なる。・変換キー操作は不要・変換したいキーワードの読みの全てを入れなくても、
そのキーワードを特定できる最低限の情報さえ与えられ
れば良い。Further, as a search interface, a set of fixed keywords can be reached with as little labor as possible, and the following points are different from the kana-kanji conversion in the prior art.・ Conversion key operation is not required.
It is only necessary to provide the minimum information that can specify the keyword.

【０１２５】従って、例えば、キーワードセットに、
「なれ」から始まる語として「ナレッジマネージメン
ト」しか入っていない場合、利用者が「な」「れ」と指
定したら、それだけで「ナレッジマネージメント」を表
示する。Therefore, for example, in the keyword set,
If only "Knowledge management" is included as a word starting with "Nare", "Knowledge management" is displayed only when the user specifies "Nana" or "Re".

【０１２６】図２４は、キーワード情報画面の一例を示
す。中間画面でキーワード「圧縮」をクリックした時の
キーワード情報画面が示されている。図２４において、
まず、代表語と同義語（「圧縮」「Compress」）とが表
示される。これは、図３のキーワードテーブル５１とキ
ーワード対応テーブル５２の情報から得ることができ
る。画面右上に「あ」とあるのは、辿ってきたパスを示
す。これにより、前に戻ったり、ハイパーテキストで迷
子になるのを防ぐことができる。また、画面には各文書
のタイトル、リンク情報、他のキーワードが表示されて
いる。優先度の高い順に一定個数（例えば20文書）以内
が表示されるので、利用者はこの中から文書を選択する
のが困難ということはない。文書ＩＤのリストは、図４
の関連文書テーブル６２から得ることができる。各文書
ＩＤについての情報は、図２の文書情報テーブル４１に
入っている。他のキーワードは、図４の関連キーワード
テーブル６３から得る。利用者は閲覧を希望する文書情
報を選択すると、その文書へリンクして文書を画面に表
示させることができる。FIG. 24 shows an example of the keyword information screen. The keyword information screen when the keyword “compression” is clicked on the intermediate screen is shown. In FIG.
First, a representative word and a synonym ("compression""Compress") are displayed. This can be obtained from the information in the keyword table 51 and the keyword correspondence table 52 in FIG. “A” in the upper right of the screen indicates the path that has been followed. Thus, it is possible to prevent the user from going back or getting lost in the hypertext. The screen displays the title, link information, and other keywords of each document. Since a certain number (for example, 20 documents) is displayed in descending order of priority, it is not difficult for the user to select a document from among them. The list of document IDs is shown in FIG.
From the related document table 62. Information on each document ID is contained in the document information table 41 of FIG. Other keywords are obtained from the related keyword table 63 of FIG. When the user selects the document information desired to be viewed, the user can link to the document and display the document on the screen.

【０１２７】図２５は、キーワード情報画面のさらなる
例を示す。中間画面でキーワード「イベントカレンダ
ー」をクリックした時のキーワード情報画面が示されて
いる。画面右上に「トップ」―「イ」―「イベント」と
あるのは、辿ってきたパスを示す。それぞれのパスを選
択することにより、利用者は先に見た画面に戻ることが
できる。FIG. 25 shows a further example of the keyword information screen. The keyword information screen when the keyword “event calendar” is clicked on the intermediate screen is shown. "Top"-"I"-"Event" at the upper right of the screen indicates the path that has been followed. By selecting each path, the user can return to the previously viewed screen.

【０１２８】図２６は第２実施形態に係わるネットワー
ク文書検索装置の構成を示す。図２６において、図１に
示す第１実施形態の構成に加えて、収集装置８１及び同
義語辞書８２を更に備える。収集装置８１はイントラネ
ット (及び・又はインターネット）から大量の文書を収
集する、例えばＷebロボットである。同義語辞書（同義
語データ）８２は同キーワード対応テーブルの情報の一
部を格納する。なお、入力装置及び出力装置は、例え
ば、ＷＷＷブラウザ８３であってもよい。FIG. 26 shows the configuration of a network document search device according to the second embodiment. 26, in addition to the configuration of the first embodiment shown in FIG. 1, a collection device 81 and a synonym dictionary 82 are further provided. The collection device 81 is, for example, a Web robot that collects a large amount of documents from an intranet (and / or the Internet). The synonym dictionary (synonym data) 82 stores a part of the information of the synonym correspondence table. The input device and the output device may be, for example, the WWW browser 83.

【０１２９】収集装置８１はネットワークから文書を自
動で収集し、処理装置１１のキーワード抽出器２２は同
義語辞書８２を用いて、収集されたページからキーワー
ドを抽出し、ページ内のキーワード出現頻度を集計す
る。つまり、同義語辞書８２を用いて重要度の高い文書
を自動選別する。これにより、イントラネット（及び・
又はインターネット）の大量の文書を自動選別すること
を可能にする。The collection device 81 automatically collects documents from the network. The keyword extractor 22 of the processing device 11 uses the synonym dictionary 82 to extract keywords from the collected pages, and determines the frequency of occurrence of keywords in the pages. Tally. That is, a document having high importance is automatically selected using the synonym dictionary 82. This allows the intranet (and
Or the Internet).

【０１３０】図２７は第３実施形態に係わる特定タイプ
文書のネットワーク文書検索装置の構成を示す。図２７
において、図２６に示す第２実施形態の構成に加えて、
処理装置１１内に文書タイプ判別器９１を備える。文書
タイプ判別器９１はイントラネット（及び・又はインタ
ーネット）から収集された文書データのリンク関係及び
ＵＲＬに基づいて、その文書データの文書タイプを判別
する。より具体的には、文書タイプ判別器９１はリンク
重要度付与器２１内で算出されたＵＲＬ類似度とリンク
重要度付与器２１により抽出されたリンク関係が示すリ
ンク／被リンク数に基づいて、（その内容を理解せず
に）文書データのコンテンツのタイプを判別する。文書
タイプ判別器９１における文書タイプの判別は以下のよ
うに行う。１．ＵＲＬ類似度が一定以下（低い）の文書データへの
リンクを一定数以上持つページはリンク集である。２．ＵＲＬ類似度が一定以上（高い）の文書データへの
リンクを一定数以上持つページはメニュー（エントリ）
ページである。３．ＵＲＬ類似度が一定以上（高い）の文書データから
一定数以上参照されているページは、メニュー（エント
リ）ページである。４．それ以外で、ＵＲＬ類似度が一定以上（高い）の文
書データへのリンクが一定数以下のページはコンテンツ
ページである。FIG. 27 shows the configuration of a network document search apparatus for a specific type document according to the third embodiment. FIG.
In addition to the configuration of the second embodiment shown in FIG.
A document type discriminator 91 is provided in the processing device 11. The document type determiner 91 determines the document type of the document data based on the link relationship and the URL of the document data collected from the intranet (and / or the Internet). More specifically, the document type discriminator 91 calculates the URL similarity calculated in the link importance assigning device 21 and the number of links / links indicated by the link relationship extracted by the link importance assigning device 21 based on the URL similarity. Determine the content type of the document data (without understanding its contents). The document type discrimination in the document type discriminator 91 is performed as follows. 1. A page having a certain number or more of links to document data whose URL similarity is equal to or less than a certain value (low) is a link collection. 2. A page having a certain number or more links to document data having a URL similarity of a certain value or more (high) is a menu (entry).
Page. 3. A page referred to by a certain number or more from document data having a URL similarity of a certain value or more (high) is a menu (entry) page. 4. In addition to the above, a page in which the link to the document data whose URL similarity is equal to or higher than a certain value (high) is a certain number or less is a content page.

【０１３１】このように判別することにより、文書タイ
プ判別器９１は、かなりの確率で文書データ（ＷＷＷペ
ージ）の文書タイプ（例えば、メニューページ、リンク
集、コンテンツページ）を分けることができる。By making such a determination, the document type determiner 91 can classify the document type (eg, menu page, link collection, content page) of the document data (WWW page) with a considerable probability.

【０１３２】文書タイプ判別器９１は、文書タイプを判
別し、判別された文書タイプ９２をキーワード文書関連
度計算器２３に出力する。キーワード文書関連度計算器
２３は、判別された文書タイプ９２に基づいて特定タイ
プの文書データを選択し、選択された文書データについ
てリンク重要度、ページキーワード及びアクセスログに
基づいて文書関連度を計算する。例えば、キーワード文
書関連度計算器２３は文書タイプがコンテンツページで
あると判別された文書データを選択して、これらのコン
テンツページについての関連度を計算するようにしても
よい。The document type discriminator 91 discriminates the document type, and outputs the discriminated document type 92 to the keyword document relevance calculator 23. The keyword document relevance calculator 23 selects document data of a specific type based on the determined document type 92, and calculates the document relevance of the selected document data based on the link importance, page keyword, and access log. I do. For example, the keyword document relevance calculator 23 may select document data whose document type is determined to be a content page and calculate the relevance of these content pages.

【０１３３】このように、図２７のネットワーク文書検
索装置は、索引に掲載するページをここで得た文書タイ
プ判別器の判別したタイプのいずれかに限定すること
で、より質の高い文書整理を行うことが可能となる。As described above, the network document search apparatus shown in FIG. 27 restricts the pages to be included in the index to any of the types determined by the document type discriminator obtained here, thereby achieving higher quality document organization. It is possible to do.

【０１３４】図２８は第４実施形態に係わるリンク集生
成システムの構成を示す。図２８において、リンク集生
成システムは、収集装置１０１、処理装置１０２及び入
出力装置１０７を備える。収集装置１０１はインターネ
ット（又は・及びイントラネット）から大量の文書デー
タを収集する、例えばＷebロボットである。処理装置１
０２は、リンク重要度付与器２１、ＵＲＬ文字列判別器
１０３、索引生成器２４及びＷＷＷサーバ１０６を備え
る。リンク重要度付与器２１はＵＲＬの類似度とリンク
関係を元に、文書のリンク重要度を算出し、算出された
リンク重要度３１を索引生成器２４へ出力する。FIG. 28 shows the configuration of a link collection generation system according to the fourth embodiment. 28, the link collection generation system includes a collection device 101, a processing device 102, and an input / output device 107. The collection device 101 is, for example, a Web robot that collects a large amount of document data from the Internet (or / and an intranet). Processing device 1
02 includes a link importance assigning unit 21, a URL character string discriminator 103, an index generator 24, and a WWW server 106. The link importance assigning unit 21 calculates the link importance of the document based on the URL similarity and the link relationship, and outputs the calculated link importance 31 to the index generator 24.

【０１３５】ＵＲＬ文字列判別器１０３はＵＲＬの文字
列上の特徴に基づいて、収集された文書データのコンテ
ンツを（その内容を理解せずに）判別する。ＵＲＬ文字
列判別器１０３は、ＵＲＬの文字列上の特徴から例え
ば、以下のように文書データのコンテンツを判別する。１．ＵＲＬの文字列にY2K, y2k, year2000を含むような
文書データは2000年問題の関連ページである。２．ＵＲＬの文字列にnews, release, pressを含み、そ
の後で数字列（往々にして日時の情報）を持つ文書デー
タはニュース（プレス）リリースのページである。３．ＵＲＬの文字列にjava, JAVAを含むページはJava関
連ページである。４．ＵＲＬの文字列にdownload, dwnload, dwnldを含む
ページは、ソフトウェアダウンロード関係のページであ
る。５．ＵＲＬの文字列にLINUX, linux, Linux を含むペー
ジは、Linux 関連ページである。The URL character string discriminator 103 discriminates the contents of the collected document data (without understanding the contents) based on the characteristics of the URL in the character string. The URL character string discriminator 103 discriminates the content of the document data, for example, from the characteristics of the URL in the character string as follows. 1. Document data in which the character string of the URL includes Y2K, y2k, year2000 is a page related to the year 2000 problem. 2. The document data that includes news, release, and press in the character string of the URL, and thereafter has a numeric string (often date and time information) is a news (press) release page. 3. Pages whose URLs include java and JAVA are Java-related pages. 4. A page including the download, dwnload, and dwnld in the URL character string is a page related to software download. 5. Pages that include LINUX, linux, or Linux in the URL string are Linux-related pages.

【０１３６】このように判別することにより、ＵＲＬ文
字列判別器１０３は特定のＵＲＬを判別し、判別された
特定ＵＲＬ集合１０４を索引生成器２４に出力する。索
引生成器２４は、リンク重要度３１に基づいて、特定Ｕ
ＲＬ集合１０４をリンク重要度の大きい順に並べて、上
位から一定個数を取り出し、上位ＵＲＬからリンク集を
作成し、リンク集１０５としてＷＷＷサーバ１０６に出
力する。なお、文字列判別器１０３で得られたＵＲＬが
少数の場合には、リンク関係を調べてそれらのリンク元
がよく参照（リンク）する他のページを加えて数を増や
してもよい。これは、しばしばＷＷＷで似たようなペー
ジは、似たようなリンク集から同時に参照されるためで
ある。ＷＷＷサーバ１０６はリンク集を利用者に提供
し、利用者はＷＷＷブラウザ１０７を介してリンク集を
表示させ、入力指示を行う。By making such a determination, the URL character string determiner 103 determines a specific URL and outputs the determined specific URL set 104 to the index generator 24. The index generator 24 determines the specific U based on the link importance 31.
The RL sets 104 are arranged in descending order of link importance, a certain number is taken out from the top, a link collection is created from the top URLs, and output to the WWW server 106 as the link collection 105. If the number of URLs obtained by the character string discriminator 103 is small, the number of links may be increased by adding another page to which the link source frequently refers (links) by checking the link relationship. This is because similar pages on the WWW are often referred to simultaneously from similar link collections. The WWW server 106 provides the link collection to the user, and the user displays the link collection via the WWW browser 107 and gives an input instruction.

【０１３７】このようにして、ＵＲＬ文字列に基づい
て、文書ページのコンテンツをみないでコンテンツを判
別し、この判別結果を用いてリンク集を作成する。これ
により、コンテンツの内容にそった質の高いリンク集を
簡便に作成することが可能となる。As described above, based on the URL character string, the content is determined without looking at the content of the document page, and a link collection is created using the determination result. As a result, it is possible to easily create a high-quality link collection according to the content of the content.

【０１３８】図２９は第５実施形態に係わるリンク集生
成システムの構成を示す。図２９のリンク集生成システ
ムは、図２８のリンク集生成システムに文書タイプ判別
器１１１を更に備える構成となっている。文書タイプ判
別器１１１の機能及び動作は、図２７を用いて説明した
第３実施形態に係わる文書検索装置の文書タイプ判別器
９１と同様である。FIG. 29 shows the configuration of a link collection generation system according to the fifth embodiment. The link collection generation system of FIG. 29 is configured to further include a document type discriminator 111 in the link collection generation system of FIG. The function and operation of the document type discriminator 111 are the same as those of the document type discriminator 91 of the document search device according to the third embodiment described with reference to FIG.

【０１３９】収集装置１０１はインターネット（又は・
及びイントラネット）から収集装置８１は大量の文書デ
ータを収集し、リンク重要度付与器２１はＵＲＬの類似
度とリンク関係を元に文書のリンク重要度を算出し、リ
ンク重要度３１を索引生成期２４へ出力する。ＵＲＬ文
字列判別器１０３はＵＲＬの文字列上の特徴に基づい
て、特定のＵＲＬを判別し、判別された特定ＵＲＬ集合
１０４を索引生成器２４へ出力する。文書タイプ判別器
１１１は、ＵＲＬ類似度とリンク／被リンク数に基づい
て、（コンテンツを見ずに）文書データの文書タイプを
判別し、判別された文書タイプ１１２を索引生成器２４
に出力する。The collection device 101 is connected to the Internet (or
And the intranet), the collection device 81 collects a large amount of document data, the link importance assigning unit 21 calculates the link importance of the document based on the similarity of the URL and the link relationship, and converts the link importance 31 into an index generation period. 24. The URL character string discriminator 103 discriminates a specific URL based on the characteristics of the URL in a character string, and outputs the determined specific URL set 104 to the index generator 24. The document type discriminator 111 discriminates the document type of the document data (without looking at the content) based on the URL similarity and the number of links / links, and outputs the discriminated document type 112 to the index generator 24.
Output to

【０１４０】索引生成器２４は、文書タイプ１１２に基
づいて特定ＵＲＬ集合１０４の中から特定の文書タイプ
の文書データを選び出す。続いて、選ばれた文書データ
のリンク重要度３１に基づいて、選ばれた文書データを
リンク重要度順の大きい順に並べて、上位から一定個数
を取り出し、上位ＵＲＬからリンク集を作成し、リンク
集１０５としてＷＷＷサーバ１０６に出力する。ＷＷＷ
サーバ１０６はリンク集を利用者に提供し、利用者はＷ
ＷＷブラウザ１０７を介してリンク集を表示させ、入力
指示を行う。[0140] The index generator 24 selects document data of a specific document type from the specific URL set 104 based on the document type 112. Subsequently, based on the link importance 31 of the selected document data, the selected document data is arranged in descending order of the link importance, a certain number is taken out from the top, a link collection is created from the upper URL, and the link collection is created. The data is output to the WWW server 106 as 105. WWW
The server 106 provides the link collection to the user, and the user
A link collection is displayed via the WW browser 107, and an input instruction is given.

【０１４１】これにより、コンテンツの内容にそった質
の高いリンク集を簡便に作成することが可能となる。と
ころで、図１、２６及び２７の文書検索装置並びに図２
８及び２９のリンク集生成システムは、図３０に示すよ
うな情報処理装置（コンピュータ）を用いて構成するこ
とができる。図３０の情報処理装置は、ＣＰＵ１２１、
メモリ１２２、入力装置１２３、出力装置１２４、外部
記憶装置１２５、媒体駆動装置１２６、及びネットワー
ク接続装置１２７を備え、それらはバス１２８により互
いに接続されている。Thus, it is possible to easily create a high-quality link collection according to the content. By the way, the document search apparatus shown in FIGS.
The link collection generation systems 8 and 29 can be configured using an information processing device (computer) as shown in FIG. The information processing apparatus of FIG.
It includes a memory 122, an input device 123, an output device 124, an external storage device 125, a medium drive device 126, and a network connection device 127, which are connected to each other by a bus 128.

【０１４２】メモリ１２２は、例えば、ＲＯＭ（Ｒｅａ
ｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（Ｒａｎｄｏｍ
ＡｃｃｅｓｓＭｅｍｏｒｙ）等を含み、処理に用い
られるプログラムとデータを格納する。ＣＰＵ１２１
は、メモリ１２２を利用してプログラムを実行すること
により、必要な処理を行う。The memory 122 is, for example, a ROM (Rea)
d Only Memory), RAM (Random)
Access Memory), and stores programs and data used for processing. CPU121
Performs necessary processing by executing a program using the memory 122.

【０１４３】図１、２６及び２７の文書検索装置並びに
図２８及び２９のリンク集生成システムの各処理装置を
構成する各機器及び各部は、それぞれメモリ１２２の特
定のプログラムコードセグメントにプログラムとして格
納される。Each device and each unit constituting each of the document search device shown in FIGS. 1, 26 and 27 and each processing device of the link collection generation system shown in FIGS. 28 and 29 are stored as programs in specific program code segments of the memory 122. You.

【０１４４】入力装置１２３は、例えば、キーボード、
ポインティングデバイス、タッチパネル等であり、ユー
ザからの指示や情報の入力に用いられる。出力装置１２
４は、例えば、ディスプレイやプリンタ等であり、利用
者への問い合わせ、処理結果等の出力に用いられる。The input device 123 is, for example, a keyboard,
It is a pointing device, a touch panel, or the like, and is used for inputting instructions and information from a user. Output device 12
Reference numeral 4 denotes, for example, a display, a printer, or the like, which is used for inquiring a user, outputting a processing result, and the like.

【０１４５】外部記憶装置１２５は、例えば、磁気ディ
スク装置、光ディスク装置、光磁気ディスク装置等であ
る。この外部記憶装置１２５に上述のプログラムとデー
タを保存しておき、必要に応じて、それらをメモリ１２
２にロードして使用することもできる。The external storage device 125 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, or the like. The above-described program and data are stored in the external storage device 125, and are stored in the memory 12 if necessary.
2 can also be used.

【０１４６】媒体駆動装置１２６は、可搬記録媒体１２
９を駆動し、その記録内容にアクセスする。可搬記録媒
体１２９としては、メモリカード、フロッピー（登録商
標）ディスク、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓ
ｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、光ディス
ク、光時期ディスク等、任意のコンピュータ読み取り可
能な記録媒体が用いられる。この可搬記録媒体１２９に
上述のプログラムとデータを格納しておき、必要に応じ
て、それらをメモリ１２２にロードして使用することも
できる。The medium driving device 126 is a portable recording medium 12
9 to access the recorded contents. Examples of the portable recording medium 129 include a memory card, a floppy (registered trademark) disk, and a CD-ROM (Compact Dis
k Read Only Memory), an optical disk, an optical disk, or any other computer-readable recording medium is used. The above-described program and data can be stored in the portable recording medium 129, and can be used by loading them into the memory 122 as needed.

【０１４７】ネットワーク接続装置１２７は、ＬＡＮ
（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ
（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）等の任意のネ
ットワーク（回線）を介して外部の装置を通信し、通信
に伴なうデータ変換を行う。また、必要に応じて、上述
のプログラムとデータを外部の装置から受け取り、それ
らをメモリ１２２にロードして使用することもできる。The network connection device 127 is a LAN
(Local Area Network), WAN
An external device is communicated via an arbitrary network (line) such as (Wide Area Network) and data conversion accompanying the communication is performed. Further, if necessary, the program and data described above can be received from an external device, and loaded into the memory 122 for use.

【０１４８】図３１は、図３０の情報処理装置にプログ
ラムとデータを供給することのできるコンピュータ読み
取り可能な記録媒体を示している。可搬記録媒体１２９
や外部のデータベース１３０保存されたプログラムとデ
ータは、メモリ１２２にロードされる。そして、ＣＰＵ
１２１は、そのデータを用いてそのプログラムを実行
し、必要な処理を行う。FIG. 31 shows a computer-readable recording medium capable of supplying a program and data to the information processing apparatus shown in FIG. Portable recording medium 129
The programs and data stored in the external database 130 are loaded into the memory 122. And CPU
121 executes the program using the data and performs necessary processing.

【０１４９】[0149]

【発明の効果】本発明によれば、ＷＷＷページの重要度
を計算する時に、リンク関係だけでなく、ＵＲＬの類似
度を考慮に入れることで、多くのページを抱える特定サ
イトやそのミラーサイトの重要度が過大に評価されにく
くして、重要度を計算することが可能となる。これによ
り、より精度高く重要文書を選定することが可能とな
る。According to the present invention, when calculating the importance of a WWW page, not only the link relation but also the similarity of the URL is taken into consideration, so that a specific site having many pages or its mirror site can be calculated. It is possible to calculate the importance by making the importance difficult to be overestimated. This makes it possible to select an important document with higher accuracy.

【０１５０】また、本発明により計算される重要度には
悪意を持った個人が意図的に操作しにくい性質がある。
また、本発明によれば、索引として、キーワードの読み
（又はスペル）の１文字以上の断片を次々クリックする
だけで、効率よくキーワード又は文書データにアクセス
することが可能となる。Further, the importance calculated according to the present invention has a property that a malicious individual does not intentionally operate.
Further, according to the present invention, it is possible to efficiently access a keyword or document data simply by clicking one or more fragments of one or more characters of the reading (or spelling) of a keyword as an index.

【０１５１】また、キーワード索引において、１画面上
のキーワードや文書数を一定個数内に収めることがで
き、利用者はキーワードを選択しやすいだけでなく、携
帯端末のような表示領域が限られたメディアでの利用に
も有効である。Also, in the keyword index, the number of keywords and documents on one screen can be kept within a certain number, so that not only is it easy for the user to select the keywords, but the display area such as a portable terminal is limited. It is also effective for media use.

【０１５２】また、本発明によれば、文書データ中のキ
ーワード出現頻度と上述の文書データの重要度に基づい
て文書関連度を計算し、この関連度の順に文書データへ
アクセスするリンクを並べることにより、特定のキーワ
ードに関連する良質の文書への迅速なアクセスを可能と
するリンク集を作成することが可能となる。Further, according to the present invention, the document relevance is calculated based on the keyword appearance frequency in the document data and the importance of the document data, and the links for accessing the document data are arranged in the order of the relevance. As a result, it is possible to create a link collection that enables quick access to high-quality documents related to a specific keyword.

【０１５３】本発明によれば、ＵＲＬの類似と参照／被
参照数を利用して、文書のタイプ（メニュー、リンク
集、コンテンツ）を判別することが可能となる。さら
に、文書タイプの判別結果に基づいて文書データを選定
し、選定された文書データについてリンク重要度の算出
結果及び・又はキーワード出現頻度を組み合わせて、よ
り質の高い文書へのアクセスを可能とするリンク集を作
成することが可能となる。According to the present invention, it is possible to determine the type of a document (menu, link collection, content) using the similarity of URL and the number of references / references. Further, the document data is selected based on the result of the determination of the document type, and the selected document data is combined with the calculation result of the link importance and / or the frequency of appearance of the keyword to enable access to a higher quality document. A link collection can be created.

【０１５４】また、本発明によれば、特定のＵＲＬを判
別することにより、高い精度で特定の分野の文書データ
を自動的に選定することが可能となる。さらに上述のリ
ンク重要度と判別された特定のＵＲＬに基づいて、特定
分野の文書データにアクセスすることを可能とするリン
ク集が精度良く簡単に作成することが可能となる。Further, according to the present invention, it is possible to automatically select document data of a specific field with high accuracy by determining a specific URL. Further, based on the specific URL determined as the above-mentioned link importance, a link collection enabling access to document data in a specific field can be easily and accurately created.

【０１５５】またさらに、上述のようにして判別した文
書タイプを元に、特定ＵＲＬの文書データの中から特定
の文書タイプの文書データを選択し、これら選択された
文書データについての上述のリンク重要に基づいてリン
ク集を作成することにより、より質の高い特定分野の文
書データにアクセスすることを可能とするリンク集を作
成することが可能となる。Further, based on the document type determined as described above, the document data of the specific document type is selected from the document data of the specific URL, and the above link important information of the selected document data is selected. By creating a link collection based on the URL, it is possible to create a link collection that enables access to higher-quality document data in a specific field.

[Brief description of the drawings]

【図１】第１実施形態に係わる文書検索装置の構成図で
ある。FIG. 1 is a configuration diagram of a document search device according to a first embodiment.

【図２】文書情報を格納するテーブル集を示す図であ
る。FIG. 2 is a diagram showing a collection of tables for storing document information.

【図３】キーワード情報を格納するテーブル集を示す図
である。FIG. 3 is a diagram showing a collection of tables for storing keyword information.

【図４】索引情報を格納するテーブル集を示す図であ
る。FIG. 4 is a diagram showing a table collection for storing index information.

【図５】アクセスログを示す図である。FIG. 5 is a diagram showing an access log.

【図６】索引生成処理の手順を示すフローチャートであ
る。FIG. 6 is a flowchart illustrating a procedure of an index generation process.

【図７】リンク重要度付与器において行う計算を模式的
に示す図（その１）である。FIG. 7 is a diagram (part 1) schematically illustrating a calculation performed by the link importance level assigning device.

【図８】リンク重要度付与器において行う計算を模式的
に示す図（その２）である。FIG. 8 is a diagram (part 2) schematically illustrating the calculation performed by the link importance assigning device.

【図９】リンク重要度を算出する際にＵＲＬ類似度の概
念を導入した結果を示す図である。FIG. 9 is a diagram showing the result of introducing the concept of URL similarity when calculating link importance.

【図１０】キーワード文字列グラフの一例を示す図であ
る。FIG. 10 is a diagram illustrating an example of a keyword character string graph.

【図１１】初期キーワード文字列グラフの生成手順を示
すフローチャートである。FIG. 11 is a flowchart illustrating a procedure for generating an initial keyword character string graph.

【図１２】初期キーワード文字列グラフの生成手順を実
現するアルゴリズムの一例を示す図である。FIG. 12 is a diagram showing an example of an algorithm for realizing a procedure for generating an initial keyword character string graph.

【図１３】中間ノードの縮退処理の手順を示すフローチ
ャートである。FIG. 13 is a flowchart illustrating a procedure of an intermediate node degeneration process.

【図１４】中間ノードの縮退処理の手順を実現するアル
ゴリズムの一例を示す図である。FIG. 14 is a diagram illustrating an example of an algorithm for realizing a procedure of an intermediate node degeneration process.

【図１５】末端ノードの縮退処理の手順を示すフローチ
ャートである。FIG. 15 is a flowchart illustrating a procedure of a terminal node degeneration process.

【図１６】末端ノードの縮退処理の手順を実現するアル
ゴリズムの一例を示す図である。FIG. 16 is a diagram illustrating an example of an algorithm for implementing a procedure of a degeneracy process of a terminal node.

【図１７】末端ノードの縮退処理をしたキーワード文字
列グラフの一例を示す図である。FIG. 17 is a diagram illustrating an example of a keyword character string graph in which a terminal node has been degenerated.

【図１８】索引画面の遷移を示す図である。FIG. 18 is a diagram showing transition of an index screen.

【図１９】索引トップ画面の一例を示す図である。FIG. 19 is a diagram illustrating an example of an index top screen.

【図２０】索引トップ画面のさらなる例を示す図であ
る。FIG. 20 is a diagram showing a further example of an index top screen.

【図２１】索引中間画面の一例を示す図である。FIG. 21 is a diagram showing an example of an index intermediate screen.

【図２２】索引中間画面のさらなる例を示す図である。FIG. 22 is a diagram showing a further example of an index intermediate screen.

【図２３】索引中間画面のさらなる例を示す図である。FIG. 23 is a diagram showing a further example of an index intermediate screen.

【図２４】キーワード情報画面の一例を示す図である。FIG. 24 is a diagram showing an example of a keyword information screen.

【図２５】キーワード情報画面のさらなる例を示す図で
ある。FIG. 25 is a diagram showing a further example of a keyword information screen.

【図２６】第２実施形態に係わる文書検索装置の構成図
である。FIG. 26 is a configuration diagram of a document search device according to a second embodiment.

【図２７】第３実施形態に係わる文書検索装置の構成図
である。FIG. 27 is a configuration diagram of a document search device according to a third embodiment.

【図２８】第４実施形態に係わるリンク集生成システム
の構成図である。FIG. 28 is a configuration diagram of a link collection generation system according to a fourth embodiment.

【図２９】第５実施形態に係わるリンク集生成システム
の構成図である。FIG. 29 is a configuration diagram of a link collection generation system according to a fifth embodiment.

【図３０】情報処理装置の構成図である。FIG. 30 is a configuration diagram of an information processing apparatus.

【図３１】記録媒体を示す図である。FIG. 31 is a diagram showing a recording medium.

【図３２】自明な読み入力インターフェースの一例を示
す図である。FIG. 32 is a diagram illustrating an example of a trivial reading input interface.

[Explanation of symbols]

１１、１０２処理装置１２、１２３入力装置１３表示装置２１リンク重要度付与器２２キーワード抽出器２３キーワード文書関連度計算器２４索引生成器２５索引アクセス部２６アクセス解析器２７ＵＲＬ類似度計算器３０文書データ３１リンク重要度３２ページキーワード３３索引データ３４アクセスログ４１文書情報テーブル４２被参照文書テーブル５１キーワードテーブル５２キーワード対応テーブル５３出現文書テーブル６１索引情報テーブル６２関連文書テーブル６３関連キーワードテーブル７１アクセスログ８１、１０１収集ロボット８２同義語データ８３、１０７ＷＷＷブラウザ９１、１１１文書タイプ判別器９２、１１２文書タイプ１０３ＵＲＬ文字列判別器１０４特定ＵＲＬ１０５リンク集１０６ＷＷＷサーバ１２１ＣＰＵ１２２メモリ１２４出力装置１２５外部記憶装置１２６媒体駆動装置１２７バス１２８ネットワーク接続装置１２９可搬記録媒体１３０プログラムデータｐ、ｐ₁、ｐ₂、ｐ₃、ｑ、ｒ₁、ｒ₂、ｓ₁、ｓ₂
ページＳステップ11, 102 processing device 12, 123 input device 13 display device 21 link importance imparting device 22 keyword extractor 23 keyword document relevance calculator 24 index generator 25 index access unit 26 access analyzer 27 URL similarity calculator 30 document Data 31 Link importance 32 Page keyword 33 Index data 34 Access log 41 Document information table 42 Referenced document table 51 Keyword table 52 Keyword correspondence table 53 Appearance document table 61 Index information table 62 Related document table 63 Related keyword table 71 Access log 81 , 101 Collection robot 82 Synonym data 83, 107 WWW browser 91, 111 Document type discriminator 92, 112 Document type 103 URL character string discriminator 104 Specific URL 05 Links 106 WWW server 121 CPU 122 memory 124 output device 125 external storage device 126 medium drive 127 bus 128 a network connection device 129 portable recording medium 130 program data _{_{p, p 1, p 2,}} p 3, q, r 1 , R ₂ , s ₁ , s ₂
Page S Step

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｆ 17/30 ３５０Ｇ０６Ｆ 17/30 ３５０Ｃ 12/00 ５４６ 12/00 ５４６Ｂ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G06F 17/30 350 G06F 17/30 350C 12/00 546 12/00 546B

Claims

[Claims]

1. A document retrieval apparatus for retrieving document data from a document data group having a link relationship, wherein the link relationship is weighted to give a link importance which is an importance to the document data. A document search apparatus comprising: an assigning unit; and an access unit that accesses the document data based on the link importance.

2. The method according to claim 1, wherein the link importance assigning unit assigns a URL (Uniform Resource) pointing to the document data.
(URL Locator), and a URL similarity calculating unit that calculates a URL similarity that is a similarity between the URLs, and calculates the link importance using the similarity between the URLs and the link relationship between the document data. The document search device according to claim 1, wherein:

3. The document search apparatus according to claim 1, further comprising a keyword extracting unit for extracting a keyword from the document data.

4. The keyword extracting means calculates an appearance frequency of the keyword in the document data, and determines a relevance between the keyword and the document data using the link importance and the appearance frequency of the keyword. The document search apparatus according to claim 3, further comprising a keyword document relevance calculating means for calculating.

5. A monitor for monitoring access from a user and creating an access log, wherein the keyword document relevance calculating means uses the keyword appearance frequency, the link importance, and the access log. The document search device according to claim 4, wherein the relevance is calculated.

6. A document type discriminating unit for discriminating a document type of the document data using the URL similarity and the number of links and the number of links of the document data. The document search apparatus according to claim 4, wherein the document data is selected based on a document type, and the relevance is calculated for the selected document data.

7. The document search device according to claim 3, further comprising: an index generation unit configured to generate an index capable of accessing the keyword from reading or spelling of the extracted keyword.

8. The apparatus according to claim 1, further comprising a selection unit that selects a reading or spelling fragment of the keyword from the index, wherein the index generation unit is configured to output the keyword document relevance calculation unit within a predetermined number of the high relevance. The document search device according to claim 7, wherein the document data is put into an index, and the access unit accesses the document data based on a fragment of reading or spelling of the selected keyword.

9. The document search device according to claim 1, further comprising a collection unit that collects the document data from a network.

10. The document search apparatus according to claim 1, wherein the link importance assigning unit reduces the importance of document data that is linked from a URL having a low similarity.

11. The document search apparatus according to claim 1, wherein the link importance assigning unit regards a page linked from important document data having a low URL similarity as important.

12. The document search apparatus according to claim 2, wherein the similarity of the URL is determined based on URL face information including a server address, a path, and a file name.

13. The target document data set is defined as DOC =
{P ₁ , p ₂ ,..., P _N } The importance of the document data p is Wp The page set of the link destination of the document data p is _{Ref (} _p) The page set of the link source of the document data p is _{Refed ( p)} The similarity of the document data p and q on the URL is sim (p, q), the difference is diff (p, q) = 1 / sim (p, q), and the link from the document data p to q is When it is said that it is stretched,
The weight lw (p, q) of the link is defined as follows: The importance of each page is obtained by setting C _q as a constant (for the lower limit of the importance, 2. The document search apparatus according to claim 1, wherein the document is obtained as a solution of the simultaneous linear equation.

14. The method according to claim 1, wherein the document data is WWW (World).
14. The document search device according to claim 1, wherein the document search device is a Wide Web page.

15. A document index generation device that generates an index of a document data group having a link relationship, wherein: a link importance assigning unit that assigns a link importance to the document data using the link relationship; Keyword extraction means for extracting a keyword from data; index generation means for generating an index capable of accessing the keyword from reading or spelling of the extracted keyword; and when the reading or spelling of the keyword is selected from the index. A document index generating apparatus, further comprising: an access unit for accessing a keyword and document data having a high link importance related to the keyword.

16. The link importance assigning means includes a URL (Uniform Reso) pointing to the document data.
and a URL similarity calculating unit that calculates a URL similarity, which is a similarity between the source and the source data. The link importance assigning unit uses the similarity between the URLs and a link relationship between the document data to obtain the link importance. The document index generation device according to claim 15, wherein a degree is calculated.

17. A document index generation device for generating an index of a document data group having a link relationship, wherein a link importance is assigned to the document data based on whether or not the URLs of the document data are similar. Link importance assigning means, keyword extracting means for extracting a keyword from the document data, and index generating means for generating an index capable of accessing the keyword from reading or spelling of the extracted keyword based on the link importance. A document index generation device, comprising:

18. A link collection generation system for generating a link collection to a document data group having a link relationship, comprising: a collection unit for collecting the document data from a network; and an importance calculated using the link relationship. Link importance assigning means for assigning a link importance to the document data, and a UR having a characteristic on a specific character string from the document data group
URL character string discriminating means for discriminating L, and index generation for generating a link collection in which links to the document data are sequentially arranged within a certain number based on the link importance and features of the URL on a specific character string. Means, and a link collection generation system, comprising:

19. The apparatus according to claim 19, further comprising: a document type determination unit configured to determine a document type of the document data using the URL similarity and the number of links and the number of links of the document data. 20. The document data is selected based on the selected document data, and the link collection is generated for the selected document data based on the link importance and a characteristic of the URL on a specific character string. The link collection generation system described in 1.

20. A document retrieval method for retrieving document data from a document data group having a link relationship, wherein a step of assigning a link importance, which is an importance calculated using the link relationship, to the document data; Accessing the document data based on the link importance.

21. A URL (U) indicating the document data.
The method further comprises: calculating a URL similarity, which is a similarity between a plurality of document resources (niform Resource Locators); and calculating the link importance using the URL similarity and a link relationship between documents. The document search method according to claim 20.

22. The document search method according to claim 20, further comprising: extracting a keyword from the document data.

23. A step of calculating an appearance frequency of the keyword in the document data; and a step of calculating an association degree between the keyword and the document data using the link importance and the appearance frequency of the keyword. The document search method according to any one of claims 20 to 22, further comprising:

24. The method further includes: monitoring access from a user to create an access log; and calculating a relevance using the keyword appearance frequency, the link importance, and the access log. 24. The document search method according to claim 23, wherein:

25. A step of determining a document type of the document data using the URL similarity and the number of links and the number of links of the document data; selecting the document data based on the document type; 25. The document search method according to claim 23, further comprising: calculating the degree of relevance of the obtained document data.

26. The document search method according to claim 22, further comprising: generating an index that can access the keyword from reading or spelling of the extracted keyword.

27. A step in which a user selects a fragment of reading or spelling of the keyword from the index, a step of inserting a certain number or less of document data having a high degree of relevance into the index, 27. The method according to claim 26, further comprising: accessing the document data based on a fragment of reading or spelling of a keyword.

28. The method according to claim 20, further comprising a step of collecting the document data from a network.

29. An automatic link collection generation method for automatically generating a link collection to a document data group having a link relationship, comprising: a step of collecting the document data from a network; A process of assigning link importance, and a UR having a specific character string feature from the document data group
L, and generating a link collection in which links to the document data are arranged in order within a certain number based on the link importance and features of the URL in a specific character string. A link collection generation method characterized by the following.

30. A step of determining the document type of the document data using the URL similarity and the number of links and the number of links of the document data; selecting the document data based on the document type; 30. The link collection generation according to claim 29, further comprising: generating the link collection based on the link importance and the specific character string feature of the URL for the obtained document data. Method.

31. A recording medium recording a program for a computer that generates a link collection to a group of document data having a link relationship, wherein: a step of collecting the document data from a network; Assigning a link importance to the document data; and a UR having a specific character string feature from the document data group.
L, and generating a link collection in which links to the document data are arranged in order within a certain number based on the link importance and the characteristic of the URL in a specific character string. And a computer-readable recording medium storing a program for causing the computer to execute the above.

32. A program for causing a computer to perform control for searching for document data from a document data group having a link relationship, wherein a weight is assigned to the link relationship and a link importance, which is an importance, is assigned to the document data. And a step of accessing the document data based on the link importance. A program for causing the computer to execute processing including:

33. A program for causing a computer to control generation of a link collection to a document data group having a link relationship, comprising: collecting the document data from a network; and using the link relationship to generate the document data. Assigning a link importance to the UR, and UR having a characteristic on a specific character string from the document data group
L, and generating a link collection in which links to the document data are arranged in order within a certain number based on the link importance and the characteristic of the URL in a specific character string. For causing the computer to execute.