JP2010140373A

JP2010140373A - Method and device for detecting document group

Info

Publication number: JP2010140373A
Application number: JP2008317790A
Authority: JP
Inventors: Satoko Shiga; 聡子志賀; Tomoya Iwakura; 友哉岩倉; Aoshi Okamoto; 青史岡本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2008-12-15
Filing date: 2008-12-15
Publication date: 2010-06-24
Anticipated expiration: 2028-12-15
Also published as: JP5396845B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document group detecting method and document group detection device for detecting a targeted document group provided on a network. <P>SOLUTION: A document collecting means 12 detects a specific document and documents belonging to a document group composed of a plurality of documents connected to the specific document on the basis of a set keyword, and specifies the specific document. Then the document collecting means 12 uses the specific document to collect collected documents. A feature accumulating means 13 analyzes the collected documents and accumulates feature items representing features of the document group. A document group determining means 14 analyzes an accumulation result of the feature items on the basis of a feature rule, and determines whether or not the document group is a document group to be detected. When it is a document group to be detected, the document group is registered as a candidate for a document group and presented to a user. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は文書群検出方法及び文書群検出装置に関し、特に、ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出方法及び文書群検出装置に関する。 The present invention relates to a document group detection method and a document group detection apparatus, and more particularly to a document group detection method for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers. And a document group detection apparatus.

近年、技術の急速な進歩に伴って日々増え続ける専門用語は、紙類に印刷される事典や辞書などで対応することが難しくなっている。一方、ネットワーク上には、このような専門用語を解説した文書の集合であって、１またはそれ以上のコンピュータによって管理されている文書群が存在する。現在最も普及しているものとして、インターネット上で提供されるワールド・ワイド・ウェブ（World Wide Web；以下、ＷＷＷとする）には、このような専門用語を解説する文書の集合体である文書群が多数存在する。このようなＷＷＷ上の文書はＷｅｂページ、文書群または文書群が置いてあるインターネット上での場所はＷｅｂサイトと呼ばれる。以下、このように専門用語を解説する文書が集合したＷｅｂサイトを、辞書サイトまたは用語解説サイト、Ｗｅｂページを解説ページと呼ぶ。辞書サイトの解説ページは日々更新されており、これらを利用することによって、最新の用語の解説を見ることができる。 In recent years, technical terms that have been increasing day by day due to rapid progress in technology have become difficult to deal with in encyclopedias and dictionaries printed on paper. On the other hand, on the network, there is a group of documents that explain such technical terms and are managed by one or more computers. As the most widespread currently available, the World Wide Web (hereinafter referred to as the WWW) provided on the Internet is a group of documents that explain such technical terms. There are many. Such a document on the WWW is called a web page, a document group, or a place on the Internet where a document group is placed. Hereinafter, a website in which documents that explain technical terms are gathered in this way is called a dictionary site or a term explanation site, and a web page is called an explanation page. The explanation page of the dictionary site is updated daily, and you can see the explanation of the latest term by using these pages.

また、任意のＷｅｂページの文中の用語について、自動的にその用語の解説ページへのリンクを貼るシステムがある。このようなシステムは、オートリンクシステムと呼ばれる。オートリンクシステムでは、予め、オートリンク対象の単語と、その単語の解説ページのＵＲＬ（Uniform Resource Locator）とを関連付けたオートリンク辞書が作成される。そして、対象のＨＴＭＬ（Hyper Text Markup Language）文書の文中にオートリンク辞書に登録された単語を検出すると、単語に関連付けられたＵＲＬへのリンクを貼る。こうして作成されたＨＴＭＬ（リンク付き）文書が出力され、ユーザに提供される。 Further, there is a system that automatically puts a link to an explanation page of a term for a term in an arbitrary Web page. Such a system is called an auto link system. In the autolink system, an autolink dictionary that associates a word to be autolinked with a URL (Uniform Resource Locator) of an explanation page of the word is created in advance. When a word registered in the autolink dictionary is detected in the sentence of the target HTML (Hyper Text Markup Language) document, a link to the URL associated with the word is pasted. An HTML (with link) document created in this way is output and provided to the user.

これらのオートリンク辞書を作成するにあたり、各単語とそのリンク先ＵＲＬとして、辞書サイトに登録されている単語とその解説ページＵＲＬの情報を用いることができる。
しかし、ネットワーク上に多数存在するＷｅｂサイトの中から辞書サイトを検出するのは容易ではない。Ｗｅｂサイトの検出に一般的に用いられている従来の検索エンジンでは、検索はＷｅｂページ単位で行われるため、ページの集合であるＷｅｂサイトは人手によって検出しなければならなかった。オートリンク辞書に登録する辞書サイトの検出も、人手によって行われており、辞書サイトの登録や登録情報の定期的なメンテナンスなどの管理にコストがかかっていた。 In creating these auto-link dictionaries, information on the words registered in the dictionary site and the explanation page URL can be used as each word and its link destination URL.
However, it is not easy to detect a dictionary site from a large number of Web sites existing on the network. In a conventional search engine that is generally used for Web site detection, search is performed in units of Web pages. Therefore, a Web site that is a set of pages has to be detected manually. Detection of a dictionary site to be registered in the autolink dictionary is also performed manually, and management such as registration of the dictionary site and periodic maintenance of registration information has been costly.

そこで、Ｗｅｂサイト単位の情報検索を行うため、各ページのメタ情報を利用してリンクタイプに分類し、分類に基づいて親ページを検索してＷｅｂサイトの内部構造を推定し、Ｗｅｂサイト単位の検索結果を出力する方法が提案されている（例えば、特許文献１参照）。また、ヒットしたページのＵＲＬと、単語の重みを反映したスコアと、によって検索結果の適合度を表す得点サイト単位で算出し、得点順に検索結果を出力する方法も提案されている（例えば、特許文献２参照）。
特開２００３−１８６８８３号公報特開２００３−１８６９０１号公報 Therefore, in order to search for information in units of websites, the meta information of each page is used to classify into link types, the parent page is searched based on the classification to estimate the internal structure of the website, and A method for outputting a search result has been proposed (see, for example, Patent Document 1). Also, a method has been proposed in which the URL of the hit page and the score reflecting the weight of the word are calculated for each scoring site representing the degree of suitability of the search results, and the search results are output in the order of the scores (for example, patents) Reference 2).
JP 2003-186883 A JP 2003-186901 A

しかし、従来のネットワーク上の文書群（例えば、Ｗｅｂサイト）単位の検索では、検索された文書群が所望のものであるかどうかが識別されないという問題点があった。
従来のＷｅｂサイト単位の検索では、検索にヒットしたページを解析し、その集合体であるＷｅｂサイトの内部構造を推定し、内部構造に基づいて検索目的に適合したＷｅｂサイトが検出されていた。しかし、検索目的に適合するとは、キーワードなどによる検索要求にマッチングしている度合が高いということであり、Ｗｅｂサイト自体が目的に適合しているかどうかを判断するものではなかった。 However, a conventional search in units of documents (for example, Web sites) on a network has a problem that it is not possible to identify whether the searched documents are desired.
In a conventional search in units of Web sites, pages hit in the search are analyzed, the internal structure of the Web site that is an aggregate of the pages is estimated, and a Web site suitable for the search purpose is detected based on the internal structure. However, conforming to the search purpose means that the degree of matching with a search request by a keyword or the like is high, and does not determine whether the Web site itself is suitable for the purpose.

例えば、オートリンクシステムで文中の用語にその用語の解説をリンクさせる場合、リンク先の情報は、辞書サイトのものであることが望ましい。これは、辞書サイトではないＷｅｂページの掲載情報は、情報内容の中立性及び一般性が保証されないことが多いことによる。したがって、単に文中の用語に用語の解説のＷｅｂページをリンクさせるだけでは、解説の内容の中立性及び一般性を保証することができない。このため、リンク先として、この種のページを極力排除し、辞書サイトに志向した検索を行う必要がある。 For example, when linking an explanation of a term to a term in a sentence with an auto link system, it is desirable that the linked information is from a dictionary site. This is due to the fact that the neutrality and generality of the information content of web page information that is not a dictionary site is often not guaranteed. Therefore, the neutrality and generality of the content of the explanation cannot be guaranteed simply by linking the term explanation Web page to the term in the sentence. For this reason, it is necessary to eliminate this type of page as much as possible as a link destination and perform a search oriented to a dictionary site.

このような事情から、従来のオートリンクシステムでは、辞書サイトの検出は、人手によって行われていた。しかし、膨大な数のＷｅｂサイトから適切な辞書サイトを検出するのは、容易な作業ではない。また、人手による作業であるため辞書の管理コストが高くなり、オートリンクサービスを提供するサービス提供者が頻繁に辞書の追加ができないという問題もある。 For these reasons, in the conventional auto link system, the dictionary site is detected manually. However, detecting an appropriate dictionary site from a huge number of Web sites is not an easy task. In addition, since it is a manual operation, the management cost of the dictionary is high, and there is also a problem that a service provider that provides an auto link service cannot frequently add a dictionary.

また、オートリンクの用途に限らず、用語とそれに関連するページのＵＲＬとを対応付けた辞書の整備の自動化は、重要な課題であり、このとき対応付けられるページは、適切なＷｅｂサイトの提供するものであることが必須である。 In addition to auto-link usage, it is important to automate the maintenance of a dictionary that associates terms with the URLs of pages associated with them. It is essential to be.

本発明はこのような点に鑑みてなされたものであり、ネットワーク上で提供される目的の文書群を検出する文書群検出方法及び文書群検出装置を提供することを目的とする。 SUMMARY An advantage of some aspects of the invention is that it provides a document group detection method and a document group detection apparatus for detecting a target document group provided on a network.

上記課題を解決するために、ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出方法が提供される。この文書群検出方法では、コンピュータによって、収集手順と、特徴集計手順と、文書群判定手順と、文書群を出力する手順と、が実行される。収集手順では、特定文書の配下に複数の配下文書が存在する階層構造を成す文書群を対象にして、特定のキーワードを用いて該配下文書のいずれかを検索する。そして、検索された該配下文書に基づいて特定文書を検出し、特定文書の配下の複数の配下文書を収集する。特徴集計手順では、収集された文書群の特定文書及び複数の配下文書それぞれについて、配下文書内の任意の文字列に付加される特定の他文書との関連を示す連結情報を抽出する。そして、該配下文書と、関連付けられた連絡先文書とで特定の関係となる状態数を集計する。文書群判定手順では、特定の関係を用いた条件である特徴ルールが格納される特徴ルール記憶手段から該特徴ルールを読み出す。そして、文書群の特定の関係の状態数が特徴ルールの条件を満たしているかを判定し、条件を満たしている文書群を対象文書群候補に登録する。文書群を出力する手順では、対象文書群候補に登録された文書群が出力される。 In order to solve the above problems, there is provided a document group detection method for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers. In this document group detection method, a collection procedure, a feature counting procedure, a document group determination procedure, and a procedure for outputting a document group are executed by a computer. In the collection procedure, one of the subordinate documents is searched using a specific keyword for a document group having a hierarchical structure in which a plurality of subordinate documents exist under the specific document. Then, the specific document is detected based on the searched subordinate document, and a plurality of subordinate documents under the specific document are collected. In the feature counting procedure, for each of the collected specific document of the document group and a plurality of subordinate documents, connection information indicating a relationship with a specific other document added to an arbitrary character string in the subordinate document is extracted. Then, the number of states having a specific relationship between the subordinate document and the associated contact document is totaled. In the document group determination procedure, the feature rule is read out from the feature rule storage means in which the feature rule that is a condition using a specific relationship is stored. Then, it is determined whether the number of states of the specific relationship of the document group satisfies the characteristic rule condition, and the document group satisfying the condition is registered in the target document group candidate. In the procedure for outputting the document group, the document group registered in the target document group candidate is output.

このような文書群検出方法によれば、特定文書の配下に複数の配下文書が存在する階層構造を成す文書群を対象にして、特定のキーワードを用いて配下文書を検索する。検索された文書に基づいて特定文書を検出し、この特定文書の配下の複数の配下文書を収集する。そして、収集された配下文書の文字列に付加される連結情報を抽出する。この連結情報に基づき、配下文書と、配下文書と関連付けられた連結先文書の関係が、特定の関係となる状態数を集計する。そして、特徴ルール記憶手段から、特定の関係を用いた条件である特徴ルールを読み出し、特定の関係の状態数が特徴ルールの条件を満たしているかどうかを判定する。そして、条件を満たしている文書群を対象文書群候補に登録し、出力する。 According to such a document group detection method, a subordinate document is searched using a specific keyword for a document group having a hierarchical structure in which a plurality of subordinate documents exist under the specific document. A specific document is detected based on the retrieved document, and a plurality of subordinate documents under the specific document are collected. Then, the link information added to the character string of the collected subordinate document is extracted. Based on this link information, the number of states in which the relationship between the subordinate document and the link destination document associated with the subordinate document becomes a specific relationship is totaled. Then, a feature rule that is a condition using a specific relationship is read from the feature rule storage unit, and it is determined whether or not the number of states of the specific relationship satisfies the condition of the feature rule. Then, the document group satisfying the condition is registered as a target document group candidate and output.

また、上記課題を解決するために、コンピュータに、上記の文書群検出方法を実行させた文書群検出装置が提供される。 In order to solve the above problem, a document group detection apparatus is provided in which a computer executes the document group detection method.

開示の文書群検出方法及び文書群検出装置によれば、キーワードを用いて特定文書の配下に複数の配下文書が存在する文書群が検索される。検索された文書群が検出対象の文書群の持つ特徴ルールを満たしているかどうかが判定され、対象文書群候補が決定される。これにより、一例としてキーワードを設定すれば、キーワードに適合する文書を含む文書群であって、特徴ルールに基づく特徴を有する目的の文書群が自動的に検出される。この結果、利用者が文書群を検出する作業を大幅に軽減することが可能となる。 According to the disclosed document group detection method and document group detection apparatus, a document group in which a plurality of subordinate documents exist under a specific document is searched using a keyword. It is determined whether the retrieved document group satisfies the feature rule of the document group to be detected, and the target document group candidate is determined. Thus, if a keyword is set as an example, a target document group including a document that matches the keyword and having a feature based on the feature rule is automatically detected. As a result, the user's task of detecting a document group can be greatly reduced.

以下、本発明の実施の形態を図面を参照して説明する。まず、発明の概要について説明し、その後、具体的な内容を説明する。
ここで、検出対象の文書群は、ネットワーク上で提供される文書の集合であって、１またはそれ以上のコンピュータによって管理されている。また、この文書群は、特定文書と、この特定文書の配下に複数の配下文書が存在する階層構造を成す。特定文書は、同じ文書群に属する他の文書を関連付けた連結情報が付加された文字列を含む文書であり、例えば、配下文書を閲覧するための目次や、索引などの文書である。特定文書では、目次などの配下文書のタイトルや、配下文書を特徴付ける語句などの文字列に対し、対応する配下文書を関連付ける連結情報が付加されている。配下文書は、特定文書の連結情報によって特定文書に関連付けられた文書である。また、配下文書についても、文書に出現する文字列が他の文書と関連付けられるときは、この文字列と他の文書とを関連付ける連結情報が文字列に付加される。例えば、文書群が辞書文書群であれば、解説対象の用語の索引などが記述される特定文書と、用語を解説する用語解説文書の階層構造を有する。特定文書では、用語解説文書で解説される用語を表す文字列に、対応する用語解説文書の連結情報が付加されている。また、用語解説文書中に他の用語解説文書で解説する用語が出現するときには、その用語を表す文字列にも対応する用語解説文書の連結情報が付加されている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, an outline of the invention will be described, and then specific contents will be described.
Here, the document group to be detected is a collection of documents provided on the network, and is managed by one or more computers. Further, this document group has a hierarchical structure in which a specific document and a plurality of subordinate documents exist under the specific document. The specific document is a document including a character string to which link information relating other documents belonging to the same document group is added. For example, the specific document is a table of contents for browsing subordinate documents, an index, or the like. In the specific document, concatenation information for associating a corresponding subordinate document with a title such as a table of contents or a character string such as a word characterizing the subordinate document is added. The subordinate document is a document associated with the specific document by the connection information of the specific document. In addition, for a subordinate document, when a character string appearing in a document is associated with another document, connection information that associates the character string with another document is added to the character string. For example, if the document group is a dictionary document group, it has a hierarchical structure of a specific document in which an index of terms to be explained is described and a term explanation document that explains terms. In the specific document, the link information of the corresponding glossary document is added to the character string representing the term explained in the glossary document. In addition, when a term explained in another glossary document appears in the glossary document, the linked information of the glossary document corresponding to the character string representing the term is also added.

本発明では、連結情報に基づいて文書間の連結関係を解析する。そして、検出対象の特徴を満たす文書群を検出し、利用者に提示する。
図１は、発明の概要を示す図である。 In the present invention, the connection relationship between documents is analyzed based on the connection information. Then, a document group satisfying the characteristics to be detected is detected and presented to the user.
FIG. 1 is a diagram showing an outline of the invention.

文書群検出装置１０は、文書記憶手段１１ａ、特徴ルール記憶手段１１ｂ、集計情報記憶手段１１ｃ、文書群候補記憶手段１１ｄ及び文書群記憶手段１１ｅの各記憶手段と、文書収集手段１２、特徴集計手段１３、文書群判定手段１４及び文書群提示手段１５の各処理手段と、を有する。 The document group detection device 10 includes a document storage unit 11a, a feature rule storage unit 11b, a total information storage unit 11c, a document group candidate storage unit 11d, and a document group storage unit 11e, a document collection unit 12, and a feature total unit. 13, each of the processing means of the document group determination means 14 and the document group presentation means 15.

文書記憶手段１１ａには、文書収集手段１２が収集した文書データが格納される。特徴ルール記憶手段１１ｂには、検出対象の文書群の特徴を表す特徴項目、検索された文書群が目的の文書群であるかどうかを判定するための特徴ルールなどが格納される。集計情報記憶手段１１ｃには、特徴集計手段１３が、収集された文書を解析して集計した特徴項目の集計結果が格納される。文書群候補記憶手段１１ｄには、文書群判定手段１４が、特徴ルールに基づき検出対象の文書群候補であると判定した対象文書群候補に関する情報が格納される。文書群記憶手段１１ｅには、文書群提示手段１５が提示した対象文書群候補のうち、利用者が目的の文書群として指定した文書群の識別情報と、この文書群に属する特定文書に記述されている連結情報が付加された文字列と、を含む文書群情報が格納される。 The document data collected by the document collection unit 12 is stored in the document storage unit 11a. The feature rule storage unit 11b stores a feature item representing the feature of the document group to be detected, a feature rule for determining whether the searched document group is the target document group, and the like. The totaling information storage unit 11c stores the totaling result of the feature items that the characteristic totaling unit 13 analyzed and totalized the collected documents. The document group candidate storage unit 11d stores information regarding the target document group candidate that the document group determination unit 14 determines to be a detection target document group candidate based on the feature rule. Of the target document group candidates presented by the document group presentation unit 15, the document group storage unit 11e is described in the identification information of the document group designated as the target document group by the user and the specific document belonging to this document group. The document group information including the character string to which the link information is added is stored.

文書収集手段１２は、設定されたキーワードに基づいてネットワーク上を検索し、キーワードに適合する文書が含まれる所定の文書群を検出する。キーワードは、検出対象の文書群の内容の特徴を表す任意の語句、あるいは一例として挙げられる語句が利用者によって指定されたものである。目的の情報を得るための語句が設定される。例えば、ネットワーク関係の情報を得たい場合には、「ネットワーク」、「ＬＡＮ（Local Area Network）」などが設定される。また、特定文書を取得されるためのネットワーク上のアドレスなどが指定されてもよい。任意の語句がキーワードに設定されたときは、検索エンジンによってキーワードに関連する文書のアドレスを取得する。このとき、キーワードに予め設定された語句を付加してさらに検索を行い、検索対象を拡張するとしてもよい。こうして検索された文書のネットワーク上の識別情報を有するアドレスに基づいて、特定文書のアドレスを検出する。一般に、ネットワーク上の文書の位置を示すアドレスは、文書群の構造と同様の階層構造をとる。そこで、検索された配下文書のアドレスから上位階層の特定文書のアドレスを予測することができる。その他、上位階層の文書のアドレスを取得する手法はよく知られており、ここではいずれかの手法を用いるとする。こうして取得された特定文書のアドレスに基づき、特定文書を取得する。なお、特定文書のアドレスが直接指定されたときは、特定文書の取得から処理を開始する。特定文書では、配下文書で関連する情報が提供される文字列には、対応する配下文書を関連付ける連結情報が付加されている。そこで、特定文書の文字列に付加される連結情報を抽出し、この連結情報に基づいて配下文書の文書データを収集する。収集した特定文書及び配下文書の文書データは、文書群ごとに文書記憶手段１１ａに格納する。 The document collection unit 12 searches the network based on the set keyword and detects a predetermined document group including documents that match the keyword. The keyword is specified by the user as an arbitrary phrase that represents the characteristics of the contents of the document group to be detected, or as an example. A word or phrase for obtaining the target information is set. For example, when it is desired to obtain network-related information, “network”, “LAN (Local Area Network)”, and the like are set. In addition, an address on a network for acquiring a specific document may be designated. When an arbitrary phrase is set as a keyword, the address of the document related to the keyword is acquired by the search engine. At this time, it is also possible to expand the search target by further searching by adding a preset phrase to the keyword. The address of the specific document is detected based on the address having the identification information on the network of the retrieved document. In general, an address indicating the position of a document on a network has a hierarchical structure similar to the structure of a document group. Therefore, it is possible to predict the address of the specific document in the upper hierarchy from the address of the subordinate document searched. In addition, a method for acquiring the address of a higher-level document is well known, and here, any one method is used. A specific document is acquired based on the address of the specific document acquired in this way. When the address of the specific document is directly designated, the process starts from acquisition of the specific document. In the specific document, the link information for associating the corresponding subordinate document is added to the character string provided with the information related to the subordinate document. Therefore, the link information added to the character string of the specific document is extracted, and the document data of the subordinate document is collected based on the link information. The document data of the collected specific document and subordinate document is stored in the document storage unit 11a for each document group.

特徴集計手段１３は、文書群ごとに、文書記憶手段１１ａに格納される特定文書を含む文書の文字列に付加された連結情報を抽出して解析し、文字列が記述される元の文書と、連結先の文書とが特定の関係となる状態数を集計する。これを特徴項目の集計と呼ぶ。すなわち、元の文書と、連結情報によって関連付けられた連結先の文書との関係が、検出対象の文書群を特徴付ける特徴項目（特定の関係）を満たしているかどうかを解析し、満たしている状態数を特徴項目ごとに集計する。これらの集計処理は、文書群ごとに行われる。また、集計結果は文書群ごとに集計情報記憶手段１１ｃに格納する。 The feature counting unit 13 extracts and analyzes the link information added to the character string of the document including the specific document stored in the document storage unit 11a for each document group, and analyzes the original document in which the character string is described. The number of states that have a specific relationship with the linked document is counted. This is called feature item aggregation. In other words, it is analyzed whether the relationship between the original document and the link destination document associated by the link information satisfies the feature item (specific relationship) that characterizes the document group to be detected, and the number of states that are satisfied For each feature item. These aggregation processes are performed for each document group. The total result is stored in the total information storage unit 11c for each document group.

文書群判定手段１４は、特徴集計手段１３による、文書群ごとの特徴項目の集計結果に基づき、この文書群が検出対象の文書群の条件を満たすかどうかを判定する。判定に用いる特徴項目や、閾値などの判定条件は、予め特徴ルール記憶手段１１ｂに格納しておく。文書群が条件を満たす場合、この文書群は対象文書群候補に選択され、文書群の識別情報が対象文書群候補テーブルに登録される。また、特徴ルールに基づく評価結果を、特徴スコアとして数値化してもよい。この場合、特徴スコアの算出方法も特徴ルールに定義しておく。対象文書群候補テーブルは、文書群候補記憶手段１１ｄに格納される。このとき、文書群の識別情報とともに、算出された特徴スコアや特定文書の連結情報などが文書群候補記憶手段１１ｄに格納されるとしてもよい。また、判定は、任意の特徴項目の集計結果を組み合わせて行うとする。複数の特徴項目を組み合わせて判定することにより、対象文書群候補が検出対象の文書群である確度（確からしさ）が高くなる。 The document group determination unit 14 determines whether or not this document group satisfies the condition of the document group to be detected based on the result of the feature item for each document group by the feature totaling unit 13. The characteristic items used for the determination and the determination conditions such as the threshold value are stored in advance in the characteristic rule storage unit 11b. When the document group satisfies the condition, this document group is selected as the target document group candidate, and the document group identification information is registered in the target document group candidate table. Further, the evaluation result based on the feature rule may be quantified as a feature score. In this case, the feature score calculation method is also defined in the feature rule. The target document group candidate table is stored in the document group candidate storage unit 11d. At this time, together with the document group identification information, the calculated feature score, the connection information of the specific document, and the like may be stored in the document group candidate storage unit 11d. Further, the determination is performed by combining the total results of arbitrary feature items. By determining by combining a plurality of feature items, the accuracy (probability) that the target document group candidate is the detection target document group increases.

文書群提示手段１５は、文書群判定手段１４によって対象文書群候補に登録された文書群の識別情報を利用者に提示する。そして、対象文書群候補のうち、利用者が選択した文書群の識別情報を目的の文書群として登録する。選択された目的の文書群の識別情報は、文書群記憶手段１１ｅに格納される。このとき、文書群の識別情報とともに、特定文書の連結情報などが文書群記憶手段１１ｅに格納されるとしてもよい。 The document group presentation unit 15 presents identification information of the document group registered in the target document group candidate by the document group determination unit 14 to the user. Then, the identification information of the document group selected by the user among the target document group candidates is registered as the target document group. The identification information of the selected target document group is stored in the document group storage unit 11e. At this time, the connection information of the specific document may be stored in the document group storage unit 11e together with the document group identification information.

このような構成の文書群検出装置１０の動作及び実行される文書群検出方法について説明する。
キーワードが入力されると、文書収集手段１２は、キーワードに基づいて、ネットワーク上で提供されるキーワードが含まれる文書を検索する。そして、検索された文書のアドレスに基づいて、特定文書のアドレスを検出し、特定文書を取得する。なお、キーワードとして特定文書のアドレスが指定されたときは、検索処理を行わず、直接特定文書を取得する。特定文書に記述される文字列には、配下文書を連結先とする連結情報が付加されている。したがって、特定文書を取得したことにより、配下文書への連結情報も取得される。文書収集手段１２は、こうして取得した配下文書への連結情報に基づいて配下文書を収集し、収集した配下文書データを文書記憶手段１１ａに格納する。一連の処理は、キーワードを用いて検索された文書ごとに行われる。これにより、文書記憶手段１１ａには、検索された文書に対応する文書群ごとに、この文書群に属する特定文書を含む複数の文書データが格納される。 An operation of the document group detection apparatus 10 having such a configuration and a document group detection method to be executed will be described.
When a keyword is input, the document collection unit 12 searches for a document including the keyword provided on the network based on the keyword. Then, based on the address of the retrieved document, the address of the specific document is detected, and the specific document is acquired. When the address of the specific document is specified as a keyword, the specific document is directly acquired without performing the search process. Link information with the subordinate document as the link destination is added to the character string described in the specific document. Therefore, by acquiring the specific document, connection information to the subordinate document is also acquired. The document collection unit 12 collects the subordinate document based on the link information to the subordinate document thus acquired, and stores the collected subordinate document data in the document storage unit 11a. A series of processing is performed for each document searched using a keyword. Thereby, a plurality of document data including specific documents belonging to the document group is stored in the document storage unit 11a for each document group corresponding to the retrieved document.

次に、特徴集計手段１３が、文書群ごとに、文書記憶手段１１ａに格納される文書データに付加されている連結情報を解析し、元の文書と連結先の文書との関係が特徴項目を満たしている数を集計する。配下文書に記述される文字列にも、この文字列に関連する他の文書がある場合には、連結情報が付加されている。特徴集計手段１３では、このように各文書に付加されている連結情報も抽出し、この文書と、連結情報によって指定される連結先の文書との関係が検出対象の文書群を特徴付ける特徴項目を満たしているかどうかを解析する。そして、特徴項目ごとに、特徴項目を満たす連結情報の数を集計する。集計結果は、集計情報として集計情報記憶手段１１ｃに格納される。続いて、文書群判定手段１４は、特徴ルール記憶手段１１ｂから特徴ルールを読み出す。さらに、集計情報記憶手段１１ｃに格納される集計情報を読み出し、文書群が対象文書群候補であるかどうかを判定する。特徴ルールには、特徴項目ごとの集計結果に基づいて文書が対象文書群の特徴を有していると判定することができるかどうかの基準が定義されている。特徴集計手段１３による特徴項目ごとの集計結果を特徴ルールと照合し、判定を行う。このとき、特徴ルールに基づいて、文書群が目的の文書群である確からしさを特徴スコアとして数値化してもよい。特徴ルールが規定する条件を満たしているときは、この文書群を対象文書群候補とし文書群の識別情報を文書群候補記憶手段１１ｄに格納する。このとき、必要であれば、算出された特徴スコア、及び特定文書の連結情報なども文書群候補記憶手段１１ｄに格納する。特徴ルールが規定する条件を満たしていないときは、この文書群を対象文書群候補としない。 Next, for each document group, the feature counting unit 13 analyzes the connection information added to the document data stored in the document storage unit 11a, and the relationship between the original document and the connection destination document is a feature item. Count the number that meets. In the character string described in the subordinate document, if there is another document related to the character string, the link information is added. The feature counting means 13 extracts the link information added to each document in this way, and the feature item that characterizes the document group to be detected by the relationship between this document and the link destination document specified by the link information. Analyze whether it meets. For each feature item, the number of pieces of linked information that satisfy the feature item is totaled. The total result is stored in the total information storage unit 11c as total information. Subsequently, the document group determination unit 14 reads out the feature rule from the feature rule storage unit 11b. Further, the total information stored in the total information storage unit 11c is read out, and it is determined whether or not the document group is a target document group candidate. In the feature rule, a criterion is defined as to whether or not it can be determined that the document has the feature of the target document group based on the total result for each feature item. The totaling result for each feature item by the feature tabulating unit 13 is checked against the feature rule to make a determination. At this time, the probability that the document group is the target document group may be quantified as a feature score based on the feature rule. When the condition defined by the feature rule is satisfied, this document group is set as a candidate document group candidate, and document group identification information is stored in the document group candidate storage unit 11d. At this time, if necessary, the calculated feature score, the connection information of the specific document, and the like are also stored in the document group candidate storage unit 11d. When the condition defined by the feature rule is not satisfied, this document group is not set as a target document group candidate.

利用者からの文書群候補の提示要求があったときは、文書群提示手段１５が、文書群候補記憶手段１１ｄに格納される文書群候補の識別情報を読み出し、利用者に提示する。例えば、対象文書群候補の識別情報を表示装置に表示する。このとき、同時に特徴スコアや特定文書の連結情報なども提供するとしてもよい。利用者は、提示された対象文書群候補が目的の文書群であると判断したときは、この対象文書群候補を目的の文書群に指定する。指定を受けた文書群提示手段１５は、指定された対象文書群候補を目的の文書群とし、この文書群の識別情報を文書群記憶手段１１ｅに登録する。このとき、文書群の識別情報とともに特定文書の連結情報も文書群記憶手段１１ｅに格納してもよい。 When there is a document group candidate presentation request from the user, the document group presentation unit 15 reads out identification information of the document group candidate stored in the document group candidate storage unit 11d and presents it to the user. For example, the identification information of the target document group candidate is displayed on the display device. At this time, a feature score, connection information of a specific document, and the like may be provided. When the user determines that the presented target document group candidate is the target document group, the user designates the target document group candidate as the target document group. Upon receiving the designation, the document group presentation unit 15 sets the designated target document group candidate as the target document group, and registers the identification information of the document group in the document group storage unit 11e. At this time, the connection information of the specific document may be stored in the document group storage unit 11e together with the identification information of the document group.

以上の処理が行われることにより、利用者が所望する情報の一例としてキーワードを設定すると、このキーワードを含む文書を有する文書群であって、予め特徴ルールに規定される特徴を有する文書群が自動的に検出され、検出された文書群の一覧が提示される。このように、目的の文書群が自動的に検出されるため、文書群を検出する作業を大幅に軽減することが可能となる。また、定期的に行われるメンテナンスなどの管理作業も容易になる。さらに、特定文書には、所定の用語（文字列）と、その文字列に関連する文書の所在を指示する連結情報と、が含まれており辞書を作成する際には、この文字列と連結情報とをそのまま用いることができる。このように、辞書を容易に作れるという利点もある。 As a result of the above processing, when a keyword is set as an example of information desired by the user, a document group having a document including the keyword and having a feature specified in advance by a feature rule is automatically generated. And a list of detected document groups is presented. As described above, since the target document group is automatically detected, the work of detecting the document group can be greatly reduced. In addition, management work such as periodic maintenance is facilitated. Furthermore, the specific document includes a predetermined term (character string) and concatenation information that indicates the location of the document related to the character string. When creating a dictionary, the specific document is concatenated with the character string. Information can be used as it is. Thus, there is also an advantage that a dictionary can be easily created.

以下、発明を、インターネット上で提供される文書群、一例として辞書サイトを検出する辞書サイト検出システムに適用した場合を例に図面を参照して詳細に説明する。検出された辞書サイトは、オートリンクシステムなどに適用される辞書の候補に用いられる。実施の形態では、閲覧者が検索により取得した文書をＷｅｂページ（以下、ページとする）、文書群がページの集合であるＷｅｂサイト（以下、サイトとする）になる。サイトは、目次や索引に相当するトップページと、トップページからリンクされる他のページで構成される。また、サイトは、１またはそれ以上のコンピュータによって管理されており、このようなコンピュータ群のインターネット上の識別子がドメインになる。したがって、サイトは、ページのＵＲＬに共通するドメインによって識別することができる。また、ページの多くは、ＨＴＭＬにより記述されている。ＨＴＭＬでは、アンカーテキストとしてページ中の文字列と他のページとをリンクさせることができる。 Hereinafter, the invention will be described in detail with reference to the drawings, taking as an example a case where the invention is applied to a document site provided on the Internet, for example, a dictionary site detection system that detects a dictionary site. The detected dictionary site is used as a dictionary candidate applied to an auto link system or the like. In the embodiment, a document acquired by a search by a viewer is a Web page (hereinafter referred to as a page), and a document group is a Web site (hereinafter referred to as a site) that is a set of pages. The site is composed of a top page corresponding to a table of contents and an index and other pages linked from the top page. Further, the site is managed by one or more computers, and an identifier on the Internet of such a computer group becomes a domain. Thus, a site can be identified by a domain common to the page URL. Many of the pages are described in HTML. In HTML, a character string in a page and another page can be linked as anchor text.

図２は、辞書サイト検出システムの構成例を示した図である。
辞書サイト検出システムは、辞書サイトを検出する辞書サイト検出サーバ１００と、検索サイト検出の指示を行うユーザのクライアント装置２００が、ネットワーク３００を介して接続する。 FIG. 2 is a diagram illustrating a configuration example of the dictionary site detection system.
In the dictionary site detection system, a dictionary site detection server 100 that detects a dictionary site and a client device 200 of a user that issues a search site detection instruction are connected via a network 300.

辞書サイト検出サーバ１００は、文書群検出装置であり、クライアント装置２００からの要求に応じて、ネットワーク上で所定の用語を解説する文書を提供する辞書サイトの候補を検出する。クライアント装置２００は、オートリンク辞書を作成する作成者の装置などで、ブラウザ２１０と、入力手段２２０とを有する。ブラウザ２１０は、辞書サイト検出サーバ１００から取得したＨＴＭＬ形式の検出結果などを図示しない表示装置に表示させる。入力手段２２０は、作成者の指示を入力し、辞書サイト検出サーバ１００に通知する。ネットワーク３００は、例えば、インターネットである。 The dictionary site detection server 100 is a document group detection device, and detects a candidate dictionary site that provides a document explaining a predetermined term on the network in response to a request from the client device 200. The client device 200 is a creator's device that creates an autolink dictionary, and includes a browser 210 and an input unit 220. The browser 210 displays an HTML format detection result obtained from the dictionary site detection server 100 on a display device (not shown). The input means 220 inputs the creator's instruction and notifies the dictionary site detection server 100. The network 300 is, for example, the Internet.

辞書サイト検出サーバ１００の構成を説明する。辞書サイト検出サーバ１００は、拡張検索ルール（記憶装置）１１１、取得サイト（記憶装置）１１２、リンク特徴データベース（以下、ＤＢとする）１１３、リンク特徴ルール（記憶装置）１１４、辞書サイト判定ルール（記憶装置）１１５、辞書追加ルール（記憶装置）１１６、辞書候補ＤＢ１１７及び辞書ＤＢ１１８の各記憶装置と、サイト取得部１２０、リンク情報抽出部１３０、リンク特徴集計部１４０、辞書サイト判定部１５０、辞書エントリ候補作成部１６０及びユーザ提示部１７０の各処理手段と、を有する。 The configuration of the dictionary site detection server 100 will be described. The dictionary site detection server 100 includes an extended search rule (storage device) 111, an acquisition site (storage device) 112, a link feature database (hereinafter referred to as DB) 113, a link feature rule (storage device) 114, a dictionary site determination rule ( Storage device) 115, dictionary addition rule (storage device) 116, dictionary candidate DB 117 and dictionary DB 118, site acquisition unit 120, link information extraction unit 130, link feature totaling unit 140, dictionary site determination unit 150, dictionary Each of the processing means of the entry candidate creation unit 160 and the user presentation unit 170.

拡張検索ルール（記憶装置）１１１には、検索のため入力されたキーワードを拡張するためのルールを定義した拡張検索ルールが格納される。例えば、「とは」「用語」「解説」など、用語の解説ページによく出現する文字列が、必要に応じて使用条件などとともに定義されている。入力されたキーワードにこのような拡張文字列を付加することにより、より解説ページらしい結果に絞り込んで検索できる。 The extended search rule (storage device) 111 stores an extended search rule that defines a rule for extending a keyword input for a search. For example, character strings that frequently appear on the explanation page of terms such as “to”, “term”, and “explanation” are defined together with use conditions as necessary. By adding such an extended character string to the input keyword, it is possible to narrow down the search to a result that seems to be an explanation page.

取得サイト（記憶装置）１１２は、文書記憶手段１１ａであり、サイト取得部１２０が取得したサイトのＵＲＬや、収集したページのページデータなどが格納される。
リンク特徴ＤＢ１１３は、集計情報記憶手段１１ｃであり、取得サイトに関し、集計されたサイトの特徴、リンク情報に関連する特徴を表す特徴項目ごとの集計結果が格納される。 The acquisition site (storage device) 112 is the document storage unit 11a, and stores the URL of the site acquired by the site acquisition unit 120, the page data of the collected pages, and the like.
The link feature DB 113 is a tabulation information storage unit 11c, and stores tabulation results for each feature item representing features of the tabulated site and features related to link information regarding the acquired site.

リンク特徴ルール（記憶装置）１１４には、辞書サイトの特徴を表す特徴項目抽出のルールを定義したリンク特徴ルールが格納される。
辞書サイト判定ルール（記憶装置）１１５には、辞書サイトが有する特徴に基づいて、サイトが辞書サイトであるかどうかを判定するためのルールを提示した辞書サイト判定ルールが格納される。辞書サイト判定ルールには、特徴項目の集計結果を用いて、辞書サイトであるかどうかを判定する条件が定義されている。リンク特徴ルール（記憶装置）１１４及び辞書サイト判定ルール（記憶装置）１１５は、特徴ルール記憶手段１１ｂに相当する。 The link feature rule (storage device) 114 stores a link feature rule that defines a feature item extraction rule that represents the feature of the dictionary site.
The dictionary site determination rule (storage device) 115 stores a dictionary site determination rule that presents a rule for determining whether or not a site is a dictionary site based on characteristics of the dictionary site. In the dictionary site determination rule, a condition for determining whether or not the site is a dictionary site is defined by using the total result of the feature items. The link feature rule (storage device) 114 and the dictionary site determination rule (storage device) 115 correspond to the feature rule storage unit 11b.

辞書追加ルール（記憶装置）１１６には、検出された辞書サイト候補を辞書に追加するためのルールを定義した辞書追加ルールが格納される。
辞書候補ＤＢ１１７は、文書群候補記憶手段１１ｄであり、辞書エントリ候補作成部１６０によって辞書候補と判定されたサイトに関する情報が設定される辞書候補テーブルが格納される。 The dictionary addition rule (storage device) 116 stores a dictionary addition rule that defines a rule for adding the detected dictionary site candidate to the dictionary.
The dictionary candidate DB 117 is a document group candidate storage unit 11d, and stores a dictionary candidate table in which information about sites determined as dictionary candidates by the dictionary entry candidate creation unit 160 is set.

辞書ＤＢ１１８は、文書群記憶手段１１ｅであり、ユーザによって辞書サイトに登録されたサイトに関する情報が設定される辞書サイトテーブルが格納される。
サイト取得部１２０は、文書収集手段１２であり、キーワードにより検索された文書を含むサイトの文書を収集する。キーワードが入力されると、拡張検索ルール（記憶装置）１１１に格納される拡張検索ルールを読み出し、拡張検索ルールに従って入力されたキーワードに拡張文字列を付加する。こうして、キーワードを拡張し、拡張されたキーワードを用いてページを検索する。そして、検索されたページからドメイン名を抽出するとともに、トップページ（特定文書）を検出する。トップページは、目次や索引などであり、その項目を表した文字列には関連するページへのリンク情報（連結情報）が付加されている。そこで、トップページのリンク情報に基づいて配下のページを収集する。収集したページデータは、サイトごとに取得サイト（記憶装置）１１２に格納する。ページデータは、ＨＴＭＬで記述された文書データである。なお、キーワードとして直接トップページのＵＲＬが指定されたときは、トップページ検出までの処理は省略し、その後処理を行う。 The dictionary DB 118 is a document group storage unit 11e, and stores a dictionary site table in which information related to sites registered in the dictionary site by the user is set.
The site acquisition unit 120 is the document collection unit 12 and collects documents on a site including documents searched by keywords. When a keyword is input, the extended search rule stored in the extended search rule (storage device) 111 is read, and an extended character string is added to the keyword input according to the extended search rule. In this way, the keyword is expanded, and the page is searched using the expanded keyword. Then, the domain name is extracted from the retrieved page and the top page (specific document) is detected. The top page is a table of contents, an index, and the like, and link information (link information) to related pages is added to a character string representing the item. Therefore, the subordinate pages are collected based on the link information of the top page. The collected page data is stored in the acquisition site (storage device) 112 for each site. The page data is document data described in HTML. When the URL of the top page is directly specified as a keyword, the process up to the top page detection is omitted and the process is performed thereafter.

リンク情報抽出部１３０及びリンク特徴集計部１４０は、特徴集計手段１３である。リンク情報抽出部１３０は、サイト取得部１２０が取得したページデータを解析し、リンク情報として、アンカーテキスト、アンカーテキストに付加されたリンク先ＵＲＬ及びリンク先のページのタイトルを抽出する。なお、リンク情報の抽出は、サイト取得部１２０が取得したすべてのページに対して行われ、トップページもその他のページも対象になる。抽出された各情報は、リンク特徴ＤＢ１１３に格納される。リンク特徴集計部１４０は、リンク特徴ルール（記憶装置）１１４に格納されるリンク特徴ルールに基づいて、リンク情報抽出部１３０が抽出したリンク情報を解析する。リンク特徴ルールには、辞書サイトが有する特徴に応じた特徴項目が定義されており、リンク情報を解析し、特徴項目を満たすリンク情報の数を集計する。例えば、リンク特徴ルールが「リンク先が同じサイト内である割合に特徴がある」ということであれば、特徴を解析するため、同じサイトへのリンク数と、すべてのリンク数とを集計する。得られた集計結果は、リンク特徴ＤＢ１１３に格納する。 The link information extraction unit 130 and the link feature totaling unit 140 are the feature totaling unit 13. The link information extraction unit 130 analyzes the page data acquired by the site acquisition unit 120 and extracts the anchor text, the link destination URL added to the anchor text, and the title of the link destination page as link information. The link information is extracted for all pages acquired by the site acquisition unit 120, and the top page and other pages are also targeted. Each piece of extracted information is stored in the link feature DB 113. The link feature totaling unit 140 analyzes the link information extracted by the link information extracting unit 130 based on the link feature rule stored in the link feature rule (storage device) 114. In the link feature rule, feature items corresponding to the features of the dictionary site are defined, link information is analyzed, and the number of link information satisfying the feature items is totaled. For example, if the link feature rule is “there is a feature that the link destination is in the same site”, in order to analyze the feature, the number of links to the same site and the number of all links are totaled. The obtained tabulation results are stored in the link feature DB 113.

辞書サイト判定部１５０及び辞書エントリ候補作成部１６０は、文書群判定手段１４である。辞書サイト判定部１５０は、辞書サイト判定ルール（記憶装置）１１５に格納される辞書サイト判定ルールを読み出す。そして、辞書サイト判定ルールと、リンク特徴ＤＢ１１３に格納される特徴項目の集計結果とに基づいて、サイトごとに当該サイトが辞書サイトであるかどうかを判定する。辞書サイトであると判定されたサイトは、辞書候補としてそのＵＲＬなどの情報を辞書候補テーブルに登録する。辞書候補テーブルは、辞書候補ＤＢ１１７に格納する。続いて、辞書エントリ候補作成部１６０は、辞書候補に登録されたサイトについて、辞書エントリ候補を作成する。辞書サイトであれば、アンカーテキストの文字列と、対応するリンク先のページのＵＲＬとの関係は、オートリンク辞書に登録される用語とその単語の解説ページのＵＲＬと同じになる。そこで、辞書サイトと判定されたサイトについて、アンカーテキストの文字列をリンク先のページのＵＲＬとを辞書エントリ候補として抽出し、辞書エントリ候補情報を生成する。このとき、辞書追加ルール（記憶装置）１１６に格納される辞書追加ルール情報を参照し、辞書エントリ候補の登録を行う。例えば、辞書追加ルール情報に除外キーワードが設定されていれば、このキーワードに相当するエントリは登録しないなどの処理を行う。生成された辞書エントリ候補情報は、対応する辞書サイトに関連付けて、辞書候補ＤＢ１１７に格納する。 The dictionary site determination unit 150 and the dictionary entry candidate creation unit 160 are the document group determination unit 14. The dictionary site determination unit 150 reads the dictionary site determination rule stored in the dictionary site determination rule (storage device) 115. Then, based on the dictionary site determination rule and the tabulation result of the feature items stored in the link feature DB 113, it is determined for each site whether the site is a dictionary site. The site determined to be a dictionary site registers information such as the URL in the dictionary candidate table as a dictionary candidate. The dictionary candidate table is stored in the dictionary candidate DB 117. Subsequently, the dictionary entry candidate creation unit 160 creates dictionary entry candidates for the sites registered as dictionary candidates. In the case of a dictionary site, the relationship between the character string of the anchor text and the URL of the corresponding linked page is the same as the URL registered in the autolink dictionary and the explanation page of the word. Therefore, for the site determined to be a dictionary site, the character string of the anchor text is extracted as the dictionary entry candidate from the URL of the linked page, and dictionary entry candidate information is generated. At this time, dictionary entry candidate information is registered with reference to dictionary addition rule information stored in the dictionary addition rule (storage device) 116. For example, if an exclusion keyword is set in the dictionary addition rule information, processing such as not registering an entry corresponding to this keyword is performed. The generated dictionary entry candidate information is stored in the dictionary candidate DB 117 in association with the corresponding dictionary site.

ユーザ提示部１７０は、文書群提示手段１５であり、検出された辞書候補のサイトに関する情報をユーザに提示する。そして、ユーザが指定したサイトを辞書サイトに登録し、この辞書サイトに関する情報を辞書サイトテーブルに設定し、辞書ＤＢ１１８に格納する。 The user presenting unit 170 is the document group presenting means 15 and presents information related to the detected dictionary candidate site to the user. Then, the site designated by the user is registered in the dictionary site, information on this dictionary site is set in the dictionary site table, and stored in the dictionary DB 118.

ここで、辞書サイト検出サーバのハードウェア構成について説明する。図３は、辞書サイト検出サーバのハードウェア構成例を示すブロック図である。
辞書サイト検出サーバ１００は、ＣＰＵ（Central Processing Unit）１０１によって装置全体が制御されている。ＣＰＵ１０１には、バス１０５を介してＲＡＭ（Random Access Memory）１０２、ハードディスクドライブ（ＨＤＤ：Hard Disk Drive）１０３及び通信インタフェース１０４が接続されている。 Here, the hardware configuration of the dictionary site detection server will be described. FIG. 3 is a block diagram illustrating a hardware configuration example of the dictionary site detection server.
The entire dictionary site detection server 100 is controlled by a CPU (Central Processing Unit) 101. A random access memory (RAM) 102, a hard disk drive (HDD) 103, and a communication interface 104 are connected to the CPU 101 via a bus 105.

ＲＡＭ１０２には、ＣＰＵ１０１に実行させるＯＳ（Operating System）のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ１０２には、ＣＰＵ１０１による処理に必要な各種データが格納される。ＨＤＤ１０３には、ＯＳやアプリケーションのプログラムが格納される。通信インタフェース１０４は、ネットワーク３００に接続されており、ネットワーク３００を介してクライアント装置２００との間でデータの送受信を行う。 The RAM 102 temporarily stores at least part of an OS (Operating System) program and application programs to be executed by the CPU 101. The RAM 102 stores various data necessary for processing by the CPU 101. The HDD 103 stores the OS and application programs. The communication interface 104 is connected to the network 300 and transmits / receives data to / from the client device 200 via the network 300.

このようなハードウェア構成によって、辞書サイト検出サーバ１００の処理機能を実現することができる。なお、辞書サイト検出サーバ１００への指示は、クライアント装置２００の入力手段２２０より入力された指示がネットワーク３００を介して送られてくる。また、検出結果などは、辞書サイト検出サーバ１００が生成した表示情報をクライアント装置２００に送信し、クライアント装置２００によって表示装置に表示される。 With such a hardware configuration, the processing function of the dictionary site detection server 100 can be realized. The instruction to the dictionary site detection server 100 is sent via the network 300 from the input unit 220 of the client device 200. Further, the detection result and the like are transmitted to the client device 200 by the display information generated by the dictionary site detection server 100 and displayed on the display device by the client device 200.

次に、検出の対象である辞書サイトの特徴について説明する。図４は、辞書サイトの特徴を説明するための図である。
一般的な辞書サイトは、索引または目次に相当するトップページ５００と、トップページ５００にエントリされている各用語を解説する解説ページ５１０，５２０，５３０とから成る階層構造を有する。トップページ５００は、辞書サイトで解説ページ５１０，５２０，５３０を提供する用語の一覧５０１をユーザに提供するためのページである。トップページ５００に設定されている各用語は、それぞれの解説ページ５１０，５２０，５３０にリンクされている。例えば、用語の一覧５０１の最上位の「ＶＰＮ（Virtual Private Network）」は、用語「ＶＰＮ」を解説する解説ページ（ファイル名はｖｐｎ．ｈｔｍｌ）５１０にリンクされている。同様に、「ＬＡＮ」は用語「ＬＡＮ」を解説する解説ページ（ファイル名はＬＡＮ．ｈｔｍｌ）５２０、「ＲＳＳ（Rich Site Summary）」は用語「ＲＳＳ」を解説する解説ページ（ファイル名はＲＳＳ．ｈｔｍｌ）５３０、にリンクされている。 Next, features of the dictionary site that is the object of detection will be described. FIG. 4 is a diagram for explaining the characteristics of the dictionary site.
A typical dictionary site has a hierarchical structure including a top page 500 corresponding to an index or a table of contents, and explanation pages 510, 520, and 530 explaining each term entered in the top page 500. The top page 500 is a page for providing a user with a list 501 of terms that provide the explanation pages 510, 520, and 530 on the dictionary site. Each term set in the top page 500 is linked to a respective explanation page 510, 520, 530. For example, “VPN (Virtual Private Network)” at the top of the term list 501 is linked to an explanation page (file name is vpn.html) 510 explaining the term “VPN”. Similarly, “LAN” is an explanation page explaining the term “LAN” (file name is LAN.html) 520, and “RSS (Rich Site Summary)” is an explanation page explaining the term “RSS” (file name is RSS.html). html) 530.

「ＶＰＮ」の解説ページ５１０は、タイトル「ＶＰＮとはＩＴ用語解説：ＤＩＣＤＩＣ」５１１と、「ＶＰＮ」を解説する解説文とを有する。また、解説文に他の解説ページで解説される用語が出現するときは、その用語に解説ページへのリンクが設定される。例えば、解説文に出現する「ＬＡＮ」は、「ＬＡＮ」の解説ページ５２０にリンクされている。 The explanation page 510 of “VPN” has a title “What is VPN? IT Glossary: DICDIC” 511 and an explanation sentence explaining “VPN”. In addition, when a term explained in another explanation page appears in the explanation sentence, a link to the explanation page is set for the term. For example, “LAN” appearing in the description sentence is linked to the explanation page 520 of “LAN”.

他の解説ページも同様である。「ＬＡＮ」の解説ページ５２０は、タイトル「ＬＡＮとはＩＴ用語解説：ＤＩＣＤＩＣ」５２１と、解説文とを有する。また、解説文の「ネットワーク」５２２は、図示しない「ネットワーク」の解説ページにリンクされている。「ＲＳＳ」の解説ページ５３０は、タイトル「ＲＳＳとはＩＴ用語解説：ＤＩＣＤＩＣ」５３１と、解説文とを有する。そして、解説文の「Ｗｅｂサイト」５３２は、図示しない「Ｗｅｂサイト」の解説ページにリンクされている。 The same is true for other commentary pages. The explanation page 520 of “LAN” has a title “What is LAN? IT Glossary: DICDIC” 521 and an explanation sentence. Further, the explanation network “network” 522 is linked to an explanation page of “network” (not shown). The explanation page 530 of “RSS” has a title “What is RSS? IT Glossary: DIDICIC” 531 and an explanation sentence. The commentary “Web site” 532 is linked to a comment page of “Web site” (not shown).

以上より、辞書サイトは、サイト内に閉じたリンクが高い割合で存在し、かつ、（特徴１）リンク元に指定された文字列（アンカーテキスト）とリンク先ページのタイトルタグ内の文字列（タイトル）とが一致する割合が高い、（特徴２）アンカーテキストとリンク先のファイル名が一致する割合が高い、（特徴３）サイト内の他ページのタイトルと、用語以外の文字列が一致する割合が高い、という特徴を有すると言える。したがって、辞書サイトであれば、サイト単位（図４の例では、トップページ５００と、その配下にリンクされる解説ページ５１０，５２０，５３０の集合）で見た場合、上記の特徴１から特徴３のいずれかの特徴に該当するページが高い割合で存在する。ゆえに、サイト内のページのリンク構造を解析し、全体リンク数のうち、サイト内へのリンク数が一定の割合を超え、かつ、上記の特徴１、特徴２、特徴３のいずれか１以上を満たすリンク数の割合が一定値以上であるかどうかを調べることにより、このサイトが辞書サイトであるかどうかを判定することができる。 As described above, the dictionary site has a high percentage of closed links in the site, and (Characteristic 1) the character string (anchor text) specified as the link source and the character string in the title tag of the linked page ( (Characteristic 2) Anchor text and link destination file name are highly consistent, (Characteristic 3) The titles of other pages in the site and character strings other than terms match It can be said that the ratio is high. Therefore, in the case of a dictionary site, when viewed in units of sites (in the example of FIG. 4, the top page 500 and the explanation pages 510, 520, and 530 linked under the top page 500), the above characteristics 1 to 3 There is a high percentage of pages that fall into any of these features. Therefore, the link structure of the pages in the site is analyzed, and the number of links to the site exceeds a certain percentage of the total number of links, and any one or more of the above feature 1, feature 2, and feature 3 is It is possible to determine whether or not this site is a dictionary site by checking whether or not the ratio of the number of links to be satisfied is equal to or greater than a certain value.

具体例を挙げて特徴の検出方法を説明する。
特徴１は、アンカーテキストと、リンク先ページのタイトルタグ内のタイトルとが一致するというものである。ＨＴＭＬでは、リンクの設定に＜ａ＞タグを利用し、アンカーテキストを＜ａｈｒｅｆ＝“・・・”＞と＜ａ／＞で囲む。“・・・”内は、リンク先を表す。例えば、ＶＰＮがアンカーテキストであるとし、＜ａｈｒｅｆ＝“ｈｔｔｐ：／／・・・／／ｖｐｎ．ｈｔｍｌ”＞ＶＰＮ＜ａ／＞は、文字列「ＶＰＮ」が、「ｈｔｔｐ：／／・・・／／ｖｐｎ．ｈｔｍｌ」にリンクされていることを表す。ここで、「ｈｔｔｐ：／／・・・／／ｖｐｎ．ｈｔｍｌ」のタイトル部分（＜ｔｉｔｌｅ＞によって示される文字列）と、アンカーテキスト「ＶＰＮ」とを照合する。タイトル部分に「ＶＰＮ」が含まれれば、このリンクは特徴１を満たすと判定される。 A feature detection method will be described with a specific example.
A feature 1 is that the anchor text matches the title in the title tag of the linked page. In HTML, an <a> tag is used for setting a link, and anchor text is enclosed by <a href=“... ”> and <a/>. “...” represents a link destination. For example, assuming that VPN is an anchor text, <a href=“http://.../vpn.html”> VPN <a/> has a character string “VPN” and “http: // ... “//Vpn.html” indicates that it is linked. Here, the title part (character string indicated by <title>) of “http: //... /Vpn.html” and the anchor text “VPN” are collated. If “VPN” is included in the title part, this link is determined to satisfy the feature 1.

特徴２は、アンカーテキストとリンク先のファイル名が一致するというものである。上記のように、リンク先のファイル名は＜ａｈｒｅｆ＝“・・・”＞より抽出することができる。特徴１の例であれば、「ｈｔｔｐ：／／・・・／／ｖｐｎ．ｈｔｍｌ」のファイル名「ｖｐｎ．ｈｔｍｌ」が抽出され、アンカーテキスト「ＶＰＮ」と照合される。一致しているときは、このリンクは特徴２を満たすと判定される。 Characteristic 2 is that the anchor text and the linked file name match. As described above, the link destination file name can be extracted from <a href=“... ”>. In the case of the feature 1, the file name “vpn.html” of “http: //... /Vpn.html” is extracted and collated with the anchor text “VPN”. If they match, it is determined that this link satisfies feature 2.

特徴３は、サイト内の他ページのタイトルと、用語以外の文字列が一致するというものである。他ページのタイトルは、上記の特徴２と同様にして抽出することができる。例えば、＜ａｈｒｅｆ＝“・・・”＞ＶＰＮ＜ａ／＞よりタイトル「ＶＰＮとはＩＴ用語解説：ＤＩＣＤＩＣ」、＜ａｈｒｅｆ＝“・・・”＞ＲＳＳ＜ａ／＞よりタイトル「ＲＳＳとはＩＴ用語解説：ＤＩＣＤＩＣ」が抽出されたとする。その後、抽出されたタイトル部分の用語を除く文字列が一致しているかどうかを比較する。この例では、「ＶＰＮ」を除く「とはＩＴ用語解説：ＤＩＣＤＩＣ」と、「ＲＳＳ」を除く「とはＩＴ用語解説：ＤＩＣＤＩＣ」とを比較する。一致しているときは、このリンクは特徴３を満たすと判定される。 Characteristic 3 is that the title of the other page in the site matches the character string other than the term. The titles of other pages can be extracted in the same manner as the above feature 2. For example, from <a href=“...”> VPN <a/> the title “What is VPN? IT Glossary: DICDIC”, <a href=“...”> RSS <a/> from title “RSS” It is assumed that “IT Glossary: DICDIC” is extracted. Thereafter, it is compared whether or not the character strings excluding the extracted terms in the title portion match. In this example, “What is IT Glossary: DICDIC” except “VPN” is compared with “What is IT Glossary: DICDIC” except “RSS”. If they match, it is determined that this link satisfies feature 3.

なお、図４の例の辞書サイトは、トップページ５００と、解説ページ５１０，５２０，５３０の２階層で構成されているが、本発明はこれに限定されない。例えば、トップページ５００と、解説ページ５１０，５２０，５３０との間に分野別索引ページを設ける階層構造の場合であっても、辞書サイトとして有する特徴は同様である。 Note that the dictionary site in the example of FIG. 4 is composed of two layers, a top page 500 and comment pages 510, 520, and 530, but the present invention is not limited to this. For example, even in the case of a hierarchical structure in which a field-specific index page is provided between the top page 500 and the explanation pages 510, 520, and 530, the characteristics of the dictionary site are the same.

以下、このような辞書サイトが有する特徴を用いて辞書サイトを検出する辞書サイト検出システムの動作及び辞書サイト検出処理の手順について具体例を用いて説明する。
最初にユーザが設定したキーワードが入力される。サイト取得部１２０は、キーワードが含まれるページを検索し、トップページを検出する。そして、トップページのリンク情報、トップページからリンクされたページ内のリンク情報に基づいて、トップページからリンクでつながった範囲のページをすべて取得する。このように、ユーザが作成したい分野に関連する語句を設定すれば、その分野の辞書サイトを検出することができる。例えば、「ＶＰＮ」と設定すれば、ＩＴ用語の辞書サイトが検出される。また、「サブプライムローン」と設定すれば、金融用語の辞書サイトが検出される。具体的な処理を説明する。 Hereinafter, the operation of the dictionary site detection system that detects a dictionary site using the characteristics of such a dictionary site and the procedure of the dictionary site detection process will be described using a specific example.
First, a keyword set by the user is input. The site acquisition unit 120 searches for a page including the keyword and detects the top page. Then, based on the link information of the top page and the link information in the page linked from the top page, all the pages in the range connected by the link from the top page are acquired. In this way, if a word or phrase related to a field that the user wants to create is set, a dictionary site in that field can be detected. For example, if “VPN” is set, a dictionary site for IT terms is detected. If “sub prime loan” is set, a dictionary site for financial terms is detected. A specific process will be described.

図５は、キーワードが入力されてからサイトのページ情報を取得するまでの処理の流れを示した図である。
ユーザによってキーワード６００が入力される。図５の例では、キーワード６００を「ＶＰＮ」としている。サイト取得部１２０は、キーワード６００が入力されると、拡張検索ルール（記憶装置）１１１に格納される拡張検索ルールに基づいて、クエリを拡張し、拡張キーワード６１０を生成する。例えば、拡張語に「とは」、「用語」が設定されていたときは、キーワード６００の「ＶＰＮ」に基づいて、「ＶＰＮとは」６１２及び「ＶＰＮ用語」６１３の２種類の拡張キーワード６１０が生成される。これにより、より解説ページらしい結果に絞り込んで検索できる。続いて、拡張キーワード６１０を用いて検索が行われる。「ＶＰＮとは」６１２及び「ＶＰＮ用語」６１３のそれぞれにキーワードを含むページが検索される。図５では、「ＶＰＮとは」６１２について、「ＶＰＮとは」を含むページ６２１，６２２，６２３の検索ページ群６２０が検出されることを示している。それぞれのドメインは、ページ６２１がｈｔｔｐ：／／ｄｉｃｄｉｃ．ｃｏｍ、ページ６２２がｈｔｔｐ：／／ａｂｃ．ｃｏｍ、ページ６２３がｈｔｔｐ：／／ａ．ｃｏ．ｊｐである。他の拡張キーワード６１３についても同様に検索ページ群が得られるが、説明は省略する。 FIG. 5 is a diagram showing a flow of processing from when a keyword is input until the page information of the site is acquired.
A keyword 600 is input by the user. In the example of FIG. 5, the keyword 600 is “VPN”. When the keyword 600 is input, the site acquisition unit 120 expands the query based on the extended search rule stored in the extended search rule (storage device) 111 and generates an extended keyword 610. For example, when “to” and “term” are set in the extended word, two types of extended keywords 610 of “what is VPN” 612 and “VPN term” 613 are based on “VPN” of the keyword 600. Is generated. As a result, it is possible to narrow down the search to a result that seems to be an explanation page. Subsequently, a search is performed using the extended keyword 610. A page including a keyword in each of “What is VPN” 612 and “VPN term” 613 is searched. FIG. 5 shows that the search page group 620 of pages 621, 622, and 623 including “What is VPN” is detected for “What is VPN” 612. For each domain, page 621 is http: // dicdic. com, page 622 is http: // abc. com, page 623 is http: // a. co. jp. A search page group is similarly obtained for the other extended keywords 613, but the description thereof is omitted.

次に、検索ページ群６２０のそれぞれのページのトップページを検出し、その文書データを取得する。トップページの検索の方法としては、検索ページ群６２０のドメインをＵＲＬとするページをトップページとして指定する。または、検索ページ群６２０の各ページ内に「トップページ」を含むアンカーテキストを検索し、そのアンカーテキストのリンク先ページをトップページとして判定する。図５では、ページ６２１のトップページ６３０を示している。ここでは、図４に示したトップページ５００が検出されるとしている。なお、キーワード６００が直接トップページのＵＲＬを指定しているときは、トップページの文書データを取得するところから処理を開始する。他のページ６２２，６２３も同様にトップページが得られるが、ここでの説明は省略する。 Next, the top page of each page of the search page group 620 is detected and its document data is acquired. As a top page search method, a page having a URL of the domain of the search page group 620 is designated as the top page. Alternatively, an anchor text including “top page” in each page of the search page group 620 is searched, and a link destination page of the anchor text is determined as a top page. FIG. 5 shows the top page 630 of the page 621. Here, the top page 500 shown in FIG. 4 is detected. When the keyword 600 directly designates the URL of the top page, the processing is started from the point where the top page document data is acquired. The top pages are obtained in the same manner for the other pages 622 and 623, but description thereof is omitted here.

次に、配下の解説ページ群６４０を収集する。図４で説明したように、トップページ５００には配下の解説ページをリンク先とするアンカーテキストが含まれている。そこで、トップページ５００に含まれるすべてのアンカーテキストとそのリンク先情報とを抽出し、解説ページを取得するクローリング処理を行う。これにより、トップページ５００に記載されたアンカーテキストに対応する解説ページ群６４０が取得される。図５の例では、解説ページ５１０のｖｐｎ．ｈｔｍｌ、解説ページ５２０のｌａｎ．ｈｔｍｌ及び解説ページ５３０のｒｓｓ．ｈｔｍｌを含む解説ページ群６４０が取得される。 Next, the subordinate explanation page group 640 is collected. As described with reference to FIG. 4, the top page 500 includes anchor text that links to the subordinate explanation page. Therefore, all anchor texts included in the top page 500 and link destination information thereof are extracted, and a crawling process for acquiring an explanation page is performed. As a result, a comment page group 640 corresponding to the anchor text written on the top page 500 is acquired. In the example of FIG. 5, vpn. html, lan. html and the rss. A comment page group 640 including html is acquired.

さらに，解説ページ５１０，５２０，５３０内にも他の解説ページへのリンク情報が含まれている場合は、そのリンク先の解説ページも取得する。これにより、解説ページ群６４０には、トップページからリンクでたどれるページがすべて含まれる。 Further, when the explanation pages 510, 520, and 530 also include link information to other explanation pages, the explanation pages of the link destinations are also acquired. Thus, the explanation page group 640 includes all pages that can be linked from the top page.

以上の処理が、拡張キーワード６１０ごとに検出された検索ページ群６２０で実行され、それぞれについてトップページ６３０と、その配下の解説ページ群６４０が収集される。 The above processing is executed for the search page group 620 detected for each extended keyword 610, and the top page 630 and the explanation page group 640 under it are collected for each.

こうして検索されたサイトのトップページ６３０とその配下の解説ページ群６４０を用いて、このサイトが辞書サイトの特徴を有しているかどうかを判定する。
まず、リンク情報抽出部１３０が、サイト（共通ドメイン）ごとに、トップページ６３０及び解説ページ群６４０のＨＴＭＬ文書ファイルを読み出し、ページ内のリンク情報を抽出する。すなわち、読み出した各ページのＨＴＭＬ文書を解析し、サイトのドメイン、解析を行った処理対象のページのＵＲＬ、アンカーテキスト、そのリンク先ＵＲＬ、リンク先のページのタイトルを抽出する。そして、ＵＲＬ−タイトルテーブル及びアンカーテキスト−リンク先ＵＲＬテーブルに登録する。なお、ドメインは、トップページ及びその配下の解説ページで共通であるので、毎回抽出する必要はない。 Using the top page 630 of the site thus searched and the explanation page group 640 under it, it is determined whether or not this site has the characteristics of a dictionary site.
First, the link information extraction unit 130 reads out HTML document files of the top page 630 and the explanation page group 640 for each site (common domain), and extracts link information in the page. That is, the HTML document of each read page is analyzed, and the domain of the site, the URL of the processing target page that has been analyzed, the anchor text, the link destination URL, and the title of the link destination page are extracted. Then, the URL-title table and the anchor text-link destination URL table are registered. Since the domain is common to the top page and the explanation page under it, it is not necessary to extract it every time.

図６は、ＵＲＬ−タイトルテーブルの一例を示した図である。ＵＲＬ−タイトルテーブルは、リンク特徴ＤＢ１１３に格納される。
ＵＲＬ−タイトルテーブル１１３１には、ドメイン１１３１ａ、ＵＲＬ１１３１ｂ及びタイトル１１３１ｃの各情報項目が登録される。 FIG. 6 is a diagram showing an example of the URL-title table. The URL-title table is stored in the link feature DB 113.
In the URL-title table 1131, information items of the domain 1131 a, the URL 1131 b, and the title 1131 c are registered.

ドメイン１１３１ａには、サイトを識別する識別子であり、トップページとその配下の解説ページ群のＵＲＬに共通して含まれるドメインが登録される。リンク情報抽出部１３０は、トップページまたは任意の解説ページからドメインを抽出し、ドメイン１１３１ａに登録する。 The domain 1131a is an identifier for identifying a site, and a domain included in common in the URL of the top page and the explanation page group under the top page is registered. The link information extraction unit 130 extracts a domain from the top page or an arbitrary explanation page and registers it in the domain 1131a.

ＵＲＬ１１３１ｂは、ＨＴＭＬを解析して抽出されるアンカーテキストのリンク先のページのＵＲＬが登録される。
タイトル１１３１ｃには、ＵＲＬ１１３１ｂに格納されるリンク先のページから抽出されたこのページのタイトルが登録される。 As the URL 1131b, the URL of the link destination page of the anchor text extracted by analyzing the HTML is registered.
In the title 1131c, the title of this page extracted from the linked page stored in the URL 1131b is registered.

図４及び図５で説明したように、例えば、トップページ５００の「ＶＰＮ」は、ＶＰＮの解説ページ５１０にリンクされるアンカーテキストであり、ＨＴＭＬでは、＜ａｈｒｅｆ＝“ｈｔｔｐ：／／ｄｉｃｄｉｃ．ｃｏｍ／ｖｐｎ．ｈｔｍｌ”＞ＶＰＮ＜ａ／＞」と記述される。ここから、リンク先として「ｈｔｔｐ：／／ｄｉｃｄｉｃ．ｃｏｍ／ｖｐｎ．ｈｔｍｌ」が抽出され、ＵＲＬ１１３１ｂに登録される。なお、このとき抽出されたＵＲＬが相対パスで記述されているときは、絶対パスに変換された後、ＵＲＬ１１３１ｂに登録される。さらに、リンク先の解説ページ５１０のｖｐｎ．ｈｔｍｌを解析し、タイトル（＜ｔｉｔｌｅ＞によって示される文字列）を抽出する。図６の例では、「ＶＰＮＩＴ用語解説：ＤＩＣＤＩＣ」が抽出され、タイトル１１３１ｃに登録される。同様にしてトップページ５００に記述されたリンク先ページのＵＲＬと、リンク先ページのタイトルが抽出され、ＵＲＬ−タイトルテーブル１１３１に登録される。トップページ５００についての処理終了後、同様の処理を解説ページ群６４０の各ページについて行う。このとき、ＵＲＬ−タイトルテーブル１１３１に同じものが既に登録されていたときは、登録を行わない。 As described with reference to FIGS. 4 and 5, for example, “VPN” on the top page 500 is anchor text linked to the VPN explanation page 510, and in HTML, <a href = “http: // dicdic. com / vpn.html "> VPN <a/>". From here, “http://dicdic.com/vpn.html” is extracted as a link destination and registered in the URL 1131b. If the URL extracted at this time is described in a relative path, it is converted into an absolute path and then registered in the URL 1131b. Further, the vpn. The html is analyzed, and the title (character string indicated by <title>) is extracted. In the example of FIG. 6, “VPN IT Glossary: DICDIC” is extracted and registered in the title 1131c. Similarly, the URL of the link destination page described in the top page 500 and the title of the link destination page are extracted and registered in the URL-title table 1131. After the process for the top page 500 is completed, the same process is performed for each page of the explanation page group 640. At this time, if the same item has already been registered in the URL-title table 1131, the registration is not performed.

こうして、リンク情報抽出部１３０によって、サイトごとに、ページに出現したリンク先のＵＲＬと、そのリンク先のページのタイトルとが抽出され、ＵＲＬ−タイトルテーブル１１３１に登録される。 In this way, the link information extraction unit 130 extracts the link destination URL that appears on the page and the title of the link destination page for each site, and registers them in the URL-title table 1131.

図７は、アンカーテキスト−リンク先ＵＲＬテーブルの一例を示した図である。アンカーテキスト−リンク先ＵＲＬテーブルは、リンク特徴ＤＢ１１３に格納される。
アンカーテキスト−リンク先ＵＲＬテーブル１１３２には、ドメイン１１３２ａ、処理対象ＵＲＬ１１３２ｂ、アンカーテキスト１１３２ｃ及びリンク先ＵＲＬ１１３２ｄの各情報項目が登録される。 FIG. 7 shows an example of the anchor text-link destination URL table. The anchor text-link destination URL table is stored in the link feature DB 113.
In the anchor text-link destination URL table 1132, information items of the domain 1132a, the processing target URL 1132b, the anchor text 1132c, and the link destination URL 1132d are registered.

ドメイン１１３２ａは、図６のドメイン１１３１ａと同様である。
処理対象ＵＲＬ１１３２ｂには、ＨＴＭＬの解析を行った処理対象のページのＵＲＬが登録される。 The domain 1132a is the same as the domain 1131a in FIG.
In the processing target URL 1132b, the URL of the processing target page subjected to the HTML analysis is registered.

アンカーテキスト１１３２ｃには、処理対象ＵＲＬ１１３２ｂに登録されるページから抽出されたアンカーテキストが登録される。
リンク先ＵＲＬ１１３２ｄには、アンカーテキスト１１３２ｃに対応するリンク先のページのＵＲＬが登録される。 In the anchor text 1132c, the anchor text extracted from the page registered in the processing target URL 1132b is registered.
In the link destination URL 1132d, the URL of the link destination page corresponding to the anchor text 1132c is registered.

図６と同様に、例えば、トップページ５００の解析を行う際には、処理対象ＵＲＬ１１３２ｂには、トップページ５００のＵＲＬ（ここでは、ｈｔｔｐ：／／ｄｉｃｄｉｃ.ｃｏｍ／ｉｎｄｅｘ．ｈｔｍｌ）が登録される。そして、「ＶＰＮ」の解説ページ５１０がリンクされるアンカーテキスト「ＶＰＮ」がアンカーテキスト１１３２ｃに登録される。また、図６と同様にして、リンク先「ｈｔｔｐ：／／ｄｉｃｄｉｃ．ｃｏｍ／ｖｐｎ．ｈｔｍｌ」が抽出され、リンク先ＵＲＬ１１３２ｄに登録される。 Similar to FIG. 6, for example, when analyzing the top page 500, the URL of the top page 500 (here, http://dicdic.com/index.html) is registered in the processing target URL 1132b. . Then, the anchor text “VPN” to which the explanation page 510 of “VPN” is linked is registered in the anchor text 1132c. Similarly to FIG. 6, the link destination “http://dicdic.com/vpn.html” is extracted and registered in the link destination URL 1132d.

トップページ５００についての処理終了後、同様の処理を解説ページ群６４０の各ページについて行う。このとき、アンカーテキスト−リンク先ＵＲＬテーブル１１３２に同じものが既に登録されていたときは、登録を行わない。 After the process for the top page 500 is completed, the same process is performed for each page of the explanation page group 640. At this time, if the same item has already been registered in the anchor text-link destination URL table 1132, the registration is not performed.

こうして、リンク情報抽出部１３０によって、サイトごとに、アンカーテキストを抽出した処理対象のページのＵＲＬ、アンカーテキスト、及びリンク先のＵＲＬが抽出され、アンカーテキスト−リンク先ＵＲＬテーブル１１３２に登録される。 In this way, the link information extraction unit 130 extracts the URL of the processing target page from which the anchor text has been extracted, the anchor text, and the URL of the link destination for each site, and registers them in the anchor text-link destination URL table 1132.

次に、リンク特徴集計部１４０は、リンク情報抽出部１３０によって設定されたＵＲＬ−タイトルテーブル１１３１及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２に基づいて特徴項目を集計する。上述のように、サイトが辞書サイトであれば、サイト内に閉じたリンクが高い割合で存在し、かつ、特徴１、特徴２、特徴３のいずれか１以上の特徴を満たすリンク数の割合が一定値以上ある。リンク特徴集計部１４０では、これらの特徴を検出するため、以下の特徴項目を集計する。まず、サイト内に閉じたリンクの割合を検出するため、全リンク数と、サイト内に閉じたリンク（内部リンクとする）数とを集計する。さらに、特徴１の特徴項目として、リンク元のアンカーテキストとリンク先ページのタイトルタグ内の文字列（タイトル）とが一致するリンクの数を集計する。特徴２の特徴項目として、アンカーテキストとリンク先のファイル名が一致するリンクの数を集計する。そして、特徴３の特徴項目として、サイト内の他ページのタイトルと、用語以外の文字列が一致するタイトルを検出し、このタイトルが出現した数を集計する。 Next, the link feature totaling unit 140 totals the feature items based on the URL-title table 1131 and the anchor text-link destination URL table 1132 set by the link information extracting unit 130. As described above, if the site is a dictionary site, there is a high percentage of links that are closed in the site, and the ratio of the number of links satisfying one or more of the features 1, features 2, and features 3 is as follows. There is more than a certain value. The link feature totaling unit 140 counts the following feature items in order to detect these features. First, in order to detect the ratio of links closed in the site, the total number of links and the number of links closed in the site (referred to as internal links) are totaled. Further, as the feature item of feature 1, the number of links where the link source anchor text matches the character string (title) in the title tag of the link destination page is totaled. As the feature item of feature 2, the number of links whose anchor text matches the file name of the link destination is totaled. Then, as a feature item of feature 3, a title in which the title of the other page in the site matches a character string other than the term is detected, and the number of appearances of the title is totaled.

図８は、特徴１による集計情報の一例を示した図である。（Ａ）は、特徴１カウンタテーブル、（Ｂ）は特徴１エントリテーブルである。どちらも、リンク特徴ＤＢ１１３に格納される。特徴１による集計情報は、ＵＲＬ−タイトルテーブル１１３１及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２を解析し、特徴１に合致するリンク数を集計して取得する。 FIG. 8 is a diagram illustrating an example of the total information based on the feature 1. (A) is a feature 1 counter table, and (B) is a feature 1 entry table. Both are stored in the link feature DB 113. The total information by the feature 1 is obtained by analyzing the URL-title table 1131 and the anchor text-link destination URL table 1132 and totaling the number of links that match the feature 1.

（Ａ）特徴１カウンタテーブル１１３３は、処理対象ＵＲＬ（ドメイン）１１３３ａ、全リンクカウンタ１１３３ｂ、内部リンクカウンタ１１３３ｃ及び特徴１カウンタ１１３３ｄの各情報項目を有する。 (A) The feature 1 counter table 1133 includes information items of a processing target URL (domain) 1133a, an all link counter 1133b, an internal link counter 1133c, and a feature 1 counter 1133d.

処理対象ＵＲＬ（ドメイン）１１３３ａには、処理対象のサイトのＵＲＬが登録される。アンカーテキスト−リンク先ＵＲＬテーブル１１３２のドメイン１１３２ａから読み出されたＵＲＬが登録される。対応するカウンタは、このサイトごとに集計された値である。 In the processing target URL (domain) 1133a, the URL of the processing target site is registered. The URL read from the domain 1132a of the anchor text-link destination URL table 1132 is registered. The corresponding counter is a value aggregated for each site.

全リンクカウンタ１１３３ｂには、サイトごとに検出されたリンク情報の集計値が登録される。集計されるリンク情報は、サイト内部のページをリンク先とするものも、サイト外部のページをリンク先とするものも含まれる。具体的には、アンカーテキスト−リンク先ＵＲＬテーブル１１３２のドメイン１１３２ａが、対象のサイトのドメインと一致するリンク先ＵＲＬ１１３２ｄに登録されたリンク情報の総数がカウントされる。 The total value of link information detected for each site is registered in the all link counter 1133b. The aggregated link information includes information that links a page inside the site to a link destination and information that links a page outside the site. Specifically, the total number of link information registered in the link destination URL 1132d in which the domain 1132a of the anchor text-link destination URL table 1132 matches the domain of the target site is counted.

内部リンクカウンタ１１３３ｃには、サイトごとに検出されたリンク情報のうち、サイト内部のページをリンク先とするリンク情報の集計値が登録される。対象のサイトのドメインとドメイン１１３２ａが一致し、リンク先ＵＲＬ１１３２ｄのドメイン部分がドメイン１１３２ａのドメイン名と一致するリンク情報の数がカウントされる。 In the internal link counter 1133c, of the link information detected for each site, a total value of link information that links to a page inside the site is registered. The number of link information in which the domain of the target site matches the domain 1132a and the domain part of the link destination URL 1132d matches the domain name of the domain 1132a is counted.

特徴１カウンタ１１３３ｄには、サイトごとに、アンカーテキストと、リンク先ＵＲＬのページのタイトルとが一致するリンク情報の集計値が登録される。アンカーテキスト−リンク先ＵＲＬテーブル１１３２のアンカーテキスト１１３２ｃから読み出したアンカーテキストに対応するリンク先ＵＲＬ１１３２ｄを抽出する。そして、抽出されたリンク先ＵＲＬと一致するＵＲＬをＵＲＬ−タイトルテーブル１１３１から検出する。一致したＵＲＬに対応するタイトル１１３１ｃからこのＵＲＬのタイトルを抽出し、最初にアンカーテキスト１１３２ｃから読み出したアンカーテキストと照合する。一致すれば、特徴１カウンタ１１３３ｄをインクリメントする。このとき、リンク先ＵＲＬ１１３２ｄと一致するＵＲＬがＵＲＬ１１３１ｂで検出されないときは、ＵＲＬに含まれる「ｉｎｄｅｘ．ｈｔｍｌ」の有無や、「＃」、「？」の有無などを変えて調整し、該当するＵＲＬを検出する。 In the feature 1 counter 1133d, a total value of link information in which the anchor text and the title of the page of the link destination URL match is registered for each site. The link destination URL 1132d corresponding to the anchor text read from the anchor text 1132c of the anchor text-link destination URL table 1132 is extracted. Then, a URL that matches the extracted link destination URL is detected from the URL-title table 1131. The title of this URL is extracted from the title 1131c corresponding to the matched URL, and collated with the anchor text first read from the anchor text 1132c. If they match, the feature 1 counter 1133d is incremented. At this time, if a URL that matches the link destination URL 1132d is not detected by the URL 1131b, adjustment is performed by changing the presence or absence of “index.html” included in the URL, the presence or absence of “#”, “?”, And the like. Is detected.

（Ｂ）特徴１エントリテーブル１１３４は、処理対象ＵＲＬ（ドメイン）１１３４ａ、単語１１３４ｂ、及びＵＲＬ１１３４ｃの各情報項目を有する。
処理対象ＵＲＬ（ドメイン）１１３４ａには、特徴１カウンタテーブル１１３３と同様に、処理対象のサイトのＵＲＬが登録される。単語１１３４ｂには、特徴１の条件を満たすアンカーテキストが登録される。ＵＲＬ１１３４ｃには、特徴１の条件を満たしたＵＲＬがアンカーテキストに対応付けて登録される。リンク特徴集計部１４０がアンカーテキストと、リンク先ＵＲＬのタイトルとを照合し、一致していると判定したとき、そのアンカーテキストが単語１１３４ｂ、リンク先ＵＲＬがＵＲＬ１１３４ｃに格納される。すなわち、特徴１カウンタテーブル１１３３の特徴１カウンタ１１３３ｄを１増加させるとき、特徴１を満たしていると判定されたアンカーテキストとリンク先ＵＲＬが登録される。処理対象ＵＲＬ（ドメイン）１１３４ａには、処理対象のサイトのドメインが登録される。 (B) The feature 1 entry table 1134 includes information items of a processing target URL (domain) 1134a, a word 1134b, and a URL 1134c.
Similar to the feature 1 counter table 1133, the URL of the processing target site is registered in the processing target URL (domain) 1134a. An anchor text that satisfies the condition 1 is registered in the word 1134b. In the URL 1134c, a URL that satisfies the condition of feature 1 is registered in association with the anchor text. When the link feature totaling unit 140 collates the anchor text with the title of the link destination URL and determines that they match, the anchor text is stored in the word 1134b and the link destination URL is stored in the URL 1134c. That is, when the feature 1 counter 1133d of the feature 1 counter table 1133 is incremented by 1, the anchor text and the link destination URL determined to satisfy the feature 1 are registered. The domain of the processing target site is registered in the processing target URL (domain) 1134a.

図９は、特徴２による集計情報の一例を示した図である。（Ｃ）は、特徴２カウンタテーブル、（Ｄ）は特徴２エントリテーブルである。どちらも、リンク特徴ＤＢ１１３に格納される。 FIG. 9 is a diagram illustrating an example of the total information based on the feature 2. (C) is a feature 2 counter table, and (D) is a feature 2 entry table. Both are stored in the link feature DB 113.

（Ｃ）特徴２カウンタテーブル１１３５は、処理対象ＵＲＬ（ドメイン）１１３５ａ、全リンクカウンタ１１３５ｂ、内部リンクカウンタ１１３５ｃ及び特徴２カウンタ１１３５ｄの各情報項目を有する。 (C) The feature 2 counter table 1135 includes information items of a processing target URL (domain) 1135a, an all link counter 1135b, an internal link counter 1135c, and a feature 2 counter 1135d.

処理対象ＵＲＬ（ドメイン）１１３５ａには、処理対象のサイトのＵＲＬが登録される。全リンクカウンタ１１３５ｂには、サイトごとに検出されたリンク情報の集計値が登録される。内部リンクカウンタ１１３５ｃには、サイトごとに検出されたリンク情報のうち、サイト内部のページをリンク先とするリンク情報の集計値が登録される。処理対象ＵＲＬ（ドメイン）１１３５ａ、全リンクカウンタ１１３５ｂ及び内部リンクカウンタ１１３５ｃは、特徴１カウンタテーブル１１３３の同じ名の情報項目と同様であるので、詳細な説明は省略する。 The URL of the processing target site is registered in the processing target URL (domain) 1135a. The total value of link information detected for each site is registered in the all link counter 1135b. In the internal link counter 1135c, of the link information detected for each site, a total value of link information that links to a page inside the site is registered. The processing target URL (domain) 1135a, the all link counter 1135b, and the internal link counter 1135c are the same as the information items having the same name in the feature 1 counter table 1133, and thus detailed description thereof is omitted.

特徴２カウンタ１１３５ｄには、サイトごとに、アンカーテキストと、リンク先ＵＲＬのページのファイル名とが一致するリンク情報の集計値が登録される。アンカーテキスト−リンク先ＵＲＬテーブル１１３２のアンカーテキスト１１３２ｃと、対応するリンク先ＵＲＬ１１３２ｄとを読み出す。そして、読み出したアンカーテキストをＵＲＬエンコードし、得られた文字列と、読み出したリンク先ＵＲＬに含まれるリンク先のファイル名とを照合する。一致すれば、特徴２カウンタ１１３５ｄをインクリメントする。 In the feature 2 counter 1135d, for each site, the total value of the link information in which the anchor text and the file name of the page of the link destination URL match is registered. The anchor text 1132c of the anchor text-link destination URL table 1132 and the corresponding link destination URL 1132d are read. Then, the read anchor text is URL-encoded, and the obtained character string is collated with the link destination file name included in the read link destination URL. If they match, the feature 2 counter 1135d is incremented.

（Ｄ）特徴２エントリテーブル１１３６は、処理対象ＵＲＬ（ドメイン）１１３６ａ、単語１１３６ｂ、及びＵＲＬ１１３６ｃの各情報項目を有する。
処理対象ＵＲＬ（ドメイン）１１３６ａには、特徴２カウンタテーブル１１３５と同様に、処理対象のサイトのＵＲＬが登録される。単語１１３６ｂには、特徴２の条件を満たすアンカーテキストが登録される。ＵＲＬ１１３６ｃには、特徴１の条件を満たしたＵＲＬがアンカーテキストに対応付けて登録される。満たす条件が特徴１ではなく特徴２であるという点を除いて、処理は特徴１の場合と同様である。すなわち、特徴２カウンタテーブル１１３５の特徴２カウンタ１１３５ｄを１増加させるとき、特徴２を満たしていると判定されたアンカーテキストとリンク先ＵＲＬが登録される。 (D) The feature 2 entry table 1136 includes information items of a processing target URL (domain) 1136a, a word 1136b, and a URL 1136c.
Similar to the feature 2 counter table 1135, the URL of the processing target site is registered in the processing target URL (domain) 1136a. An anchor text satisfying the condition 2 is registered in the word 1136b. In the URL 1136c, a URL satisfying the condition of feature 1 is registered in association with the anchor text. The processing is the same as in the case of the feature 1 except that the condition to be satisfied is not the feature 1 but the feature 2. That is, when the feature 2 counter 1135d of the feature 2 counter table 1135 is incremented by 1, the anchor text determined to satisfy the feature 2 and the link destination URL are registered.

図１０は、特徴３による集計情報の一例を示した図である。（Ｅ）は、特徴３カウンタテーブル、（Ｆ）は共通タイトルテーブル、（Ｇ）は特徴３エントリテーブルである。すべてリンク特徴ＤＢ１１３に格納される。 FIG. 10 is a diagram illustrating an example of the total information according to the feature 3. (E) is a feature 3 counter table, (F) is a common title table, and (G) is a feature 3 entry table. All are stored in the link feature DB 113.

（Ｅ）特徴３カウンタテーブル１１３７は、処理対象ＵＲＬ（ドメイン）１１３７ａ、全リンクカウンタ１１３７ｂ、内部リンクカウンタ１１３７ｃ、特徴３カウンタ１１３７ｄ及び共通タイトル１１３７ｅの各情報項目を有する。 (E) The feature 3 counter table 1137 includes information items of a processing target URL (domain) 1137a, an all link counter 1137b, an internal link counter 1137c, a feature 3 counter 1137d, and a common title 1137e.

処理対象ＵＲＬ（ドメイン）１１３７ａには、処理対象のサイトのＵＲＬが登録される。全リンクカウンタ１１３７ｂには、サイトごとに検出されたリンク情報の集計値が登録される。内部リンクカウンタ１１３７ｃには、サイトごとに検出されたリンク情報のうち、サイト内部のページをリンク先とするリンク情報の集計値が登録される。処理対象ＵＲＬ（ドメイン）１１３７ａ、全リンクカウンタ１１３７ｂ及び内部リンクカウンタ１１３７ｃは、特徴１カウンタテーブル１１３３の同じ名の情報項目と同様であるので、詳細な説明は省略する。 The URL of the processing target site is registered in the processing target URL (domain) 1137a. The total value of link information detected for each site is registered in the all link counter 1137b. In the internal link counter 1137c, of the link information detected for each site, a total value of link information that links to a page inside the site is registered. Since the processing target URL (domain) 1137a, the all link counter 1137b, and the internal link counter 1137c are the same as the information items having the same name in the feature 1 counter table 1133, detailed description thereof is omitted.

特徴３カウンタ１１３７ｄには、サイトごとに、アンカーテキストを除いたリンク先のＵＲＬのページのタイトルが共通するリンク情報の集計値が登録される。また、共通タイトル１１３７には、そのタイトルが登録される。リンク特徴集計部１４０は、ＵＲＬ−タイトルテーブル１１３１及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２を解析し、共通タイトルテーブルを作成する。そこで共通タイトルテーブル１１３８に登録された共通タイトルのうち、最もリンク情報の数が多かったものが登録される。また、その共通タイトルは、共通タイトル１１３７ｅに登録される。 In the feature 3 counter 1137d, for each site, a total value of link information in which the title of the link destination URL page excluding the anchor text is common is registered. In addition, the title is registered in the common title 1137. The link feature totaling unit 140 analyzes the URL-title table 1131 and the anchor text-link destination URL table 1132 to create a common title table. Therefore, among the common titles registered in the common title table 1138, the one with the largest number of link information is registered. The common title is registered in the common title 1137e.

（Ｆ）共通タイトルテーブル１１３８は、処理対象ＵＲＬ（ドメイン）１１３８ａ、共通タイトル１１３８ｂ及びカウンタ１１３８ｃの各情報項目を有する。
処理対象ＵＲＬ（ドメイン）１１３８ａには、処理対象のサイトのＵＲＬが登録される。共通タイトル１１３８ｂには、抽出されたリンク先ＵＲＬの共通タイトルが登録される。そして、カウンタ１１３８ｃには、共通タイトルが出現したリンク情報の集計値が登録される。 (F) The common title table 1138 includes information items of a processing target URL (domain) 1138a, a common title 1138b, and a counter 1138c.
The URL of the processing target site is registered in the processing target URL (domain) 1138a. The common title of the extracted link destination URL is registered in the common title 1138b. In the counter 1138c, the total value of the link information in which the common title appears is registered.

アンカーテキスト−リンク先ＵＲＬテーブル１１３２のアンカーテキスト１１３２ｃに対応するリンク先ＵＲＬ１１３２ｄと一致するＵＲＬをＵＲＬ−タイトルテーブル１１３１から検出する。一致したＵＲＬ１１３１ｂに対応するＵＲＬでＵＲＬ−タイトルテーブル１１３１のＵＲＬ１１３１ｂに対応するタイトル１１３１ｃを抽出し、最初に読み出したアンカーテキスト１１３２ｃを除いた文字列を抽出する。この文字列が共通タイトルテーブルの共通タイトル１１３８ｂに登録されていれば、対応するカウンタ１１３８ｃをインクリメントする。共通タイトル１１３８ｂに登録されていないときは、共通タイトルテーブル１１３８に新たなレコードを追加し、登録する。 The URL that matches the link destination URL 1132d corresponding to the anchor text 1132c of the anchor text-link destination URL table 1132 is detected from the URL-title table 1131. The title 1131c corresponding to the URL 1131b of the URL-title table 1131 is extracted with the URL corresponding to the matched URL 1131b, and the character string excluding the anchor text 1132c read first is extracted. If this character string is registered in the common title 1138b of the common title table, the corresponding counter 1138c is incremented. If not registered in the common title 1138b, a new record is added to the common title table 1138 and registered.

（Ｇ）特徴３エントリテーブル１１３９は、処理対象ＵＲＬ（ドメイン）１１３９ａ、単語１１３９ｂ、ＵＲＬ１１３９ｃ及び共通タイトル１１３９ｄの各情報項目を有する。
処理対象ＵＲＬ（ドメイン）１１３９ａには、特徴３カウンタテーブル１１３７と同様に、処理対象のサイトのＵＲＬが登録される。単語１１３９ｂには、特徴３の条件を満たすアンカーテキストが登録される。ＵＲＬ１１３９ｃには、特徴３の条件を満たしたＵＲＬがアンカーテキストに対応付けて登録される。共通タイトル１１３９ｄには、対応するタイトルが登録される。 (G) The feature 3 entry table 1139 includes information items of a processing target URL (domain) 1139a, a word 1139b, a URL 1139c, and a common title 1139d.
Similar to the feature 3 counter table 1137, the URL of the processing target site is registered in the processing target URL (domain) 1139a. An anchor text that satisfies the condition 3 is registered in the word 1139b. In the URL 1139c, a URL satisfying the condition of feature 3 is registered in association with the anchor text. The corresponding title is registered in the common title 1139d.

このように、リンク特徴集計部１４０によって、抽出されたリンク情報を登録したＵＲＬ−タイトルテーブル１１３１及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２が解析される。そして、全リンクカウンタ及び内部リンクカウンタとともに、辞書サイトの特徴を満たすリンク数である特徴１カウンタ、特徴２カウンタ、特徴３カウンタとが集計され、それぞれの要件を満たすエントリ候補テーブルが作成される。特徴１については、特徴１カウンタテーブル１１３３及び特徴１エントリテーブル１１３４が生成される。特徴２については、特徴２カウンタテーブル１１３５及び特徴２エントリテーブル１１３６が生成される。特徴３については、特徴３カウンタテーブル１１３７及び特徴３エントリテーブル１１３９が生成される。なお、集計は、予め指定された特徴についてのみ行われる。 In this way, the link feature totaling unit 140 analyzes the URL-title table 1131 and the anchor text-link destination URL table 1132 in which the extracted link information is registered. Then, together with the total link counter and the internal link counter, the feature 1 counter, the feature 2 counter, and the feature 3 counter, which are the number of links that satisfy the features of the dictionary site, are totaled, and an entry candidate table that satisfies the respective requirements is created. For feature 1, a feature 1 counter table 1133 and a feature 1 entry table 1134 are generated. For feature 2, a feature 2 counter table 1135 and a feature 2 entry table 1136 are generated. For feature 3, a feature 3 counter table 1137 and a feature 3 entry table 1139 are generated. Note that aggregation is performed only for features designated in advance.

辞書サイト判定部１５０は、集計された特徴１カウンタテーブル１１３３、特徴２カウンタテーブル１１３５及び特徴３カウンタテーブル１１３７を用いて、辞書サイト判定ルールに基づき、処理対象ＵＲＬ（ドメイン）が、辞書サイトであるかどうかを判定する。辞書サイト判定ルールは、辞書サイト判定ルール（記憶装置）１１５に格納されているのを読み出して適用する。 The dictionary site determination unit 150 uses the aggregated feature 1 counter table 1133, feature 2 counter table 1135, and feature 3 counter table 1137, and based on the dictionary site determination rule, the processing target URL (domain) is a dictionary site. Determine whether or not. The dictionary site determination rule is read out from the dictionary site determination rule (storage device) 115 and applied.

例えば、辞書サイト判定ルールが、「全体リンク数のうち、サイト内リンクが９割以上、かつ特徴１を満たすリンク数の割合が９割以上」というものであった場合で説明する。この場合は、特徴１カウンタテーブル１１３３を参照してサイト内リンクの割合及び特徴（１）を満たすリンクの割合を算出する。 For example, a description will be given of a case where the dictionary site determination rule is “the ratio of the number of links satisfying feature 1 is 90% or more among the total links, and 90% or more in-site links”. In this case, the ratio of intra-site links and the ratio of links satisfying the characteristic (1) are calculated with reference to the feature 1 counter table 1133.

サイト内リンクの割合は、全リンクに占める内部リンクの割合であるので、内部リンク数／全リンク数で求めることができる。例えば、処理対象ＵＲＬが「ｈｔｔｐ：／／ｄｉｃｄｉｃ／ｃｏｍ／」の場合、全リンクカウンタ１１３３ｂは「１１０」、内部リンクカウンタ１１３３ｃは「１０１」であるので、内部リンクの割合は、Ｒは、
Ｒ＝１０１／１１０（＝０．９１８１）
と、算出することができる。また、特徴１を満たすリンク数の割合Ｒ１は、
Ｒ１＝１００／１０１（＝０．９９）
と、算出される。判定ルールが上記の場合、Ｂの値が辞書らしさスコアになる。 Since the ratio of intra-site links is the ratio of internal links to all links, it can be calculated by the number of internal links / total number of links. For example, when the processing target URL is “http: // dicdic / com /”, the total link counter 1133b is “110” and the internal link counter 1133c is “101”.
R = 101/110 (= 0.9181)
And can be calculated. Further, the ratio R1 of the number of links satisfying the feature 1 is
R1 = 100/101 (= 0.99)
And calculated. When the determination rule is the above, the value of B is a dictionary-like score.

さらに、特徴を組み合わせて判定ルールとすることができる。例えば、「全体リンク数のうち、サイト内リンクが９割以上、かつ、特徴１を満たすリンク数の割合が９割以上、もしくは特徴２を満たすリンク数の割合が９割以上、かつ特徴３を満たすリンク数の割合が９割以上」というものであった場合で説明する。 Furthermore, a combination of features can be used as a determination rule. For example, “Of the total number of links, 90% or more of the links in the site and 90% or more of the links satisfying feature 1 or 90% or more of the links satisfying feature 2 and feature 3 A case where the ratio of the number of links to be satisfied is 90% or more will be described.

この場合、上記と同様にして、特徴１カウンタテーブル１１３３、特徴２カウンタテーブル１１３５、または特徴３カウンタテーブル１１３７から、該当するＵＲＬの全リンクカウンタと内部リンクカウンタの値を抽出し、内部リンクの割合（Ｒ）を算出する。また、特徴１カウンタテーブル１１３３の特徴１カウンタ１１３３ｄと内部リンクカウンタ１１３３ｃから特徴１を満たすリンク数の割合（Ｒ１）を算出する。特徴２カウンタテーブル１１３５の特徴２カウンタ１１３５ｄと内部リンクカウンタ１１３５ｃから特徴２を満たすリンク数の割合（Ｒ２）を算出する。そして、特徴３カウンタテーブル１１３７の特徴３カウンタ１１３７ｄと内部リンクカウンタ１１３７ｃから特徴３を満たすリンク数の割合（Ｒ３）を算出する。そして、算出されたＲ，Ｒ１，Ｒ２，Ｒ３でルールが成立するかどうかを判定する。ルールが成立すれば、このサイトは辞書サイトと判定することができる。辞書サイトと判定されたサイトは、辞書候補テーブルに登録される。 In this case, in the same manner as described above, the values of all link counters and internal link counters of the corresponding URL are extracted from the feature 1 counter table 1133, the feature 2 counter table 1135, or the feature 3 counter table 1137, and the ratio of the internal links (R) is calculated. Further, the ratio (R1) of the number of links satisfying the feature 1 is calculated from the feature 1 counter 1133d and the internal link counter 1133c of the feature 1 counter table 1133. The ratio (R2) of the number of links satisfying the feature 2 is calculated from the feature 2 counter 1135d and the internal link counter 1135c of the feature 2 counter table 1135. Then, the ratio (R3) of the number of links satisfying the feature 3 is calculated from the feature 3 counter 1137d and the internal link counter 1137c of the feature 3 counter table 1137. Then, it is determined whether or not the rule is established based on the calculated R, R1, R2, and R3. If the rule is established, this site can be determined as a dictionary site. Sites determined to be dictionary sites are registered in the dictionary candidate table.

また、特徴ごとのリンク数の割合を重み付けし、スコアを算出するとしてもよい。特徴１の重み付け係数α、特徴２の重み付け係数β、特徴３の重み付け係数γとして、スコアＳは、
Ｓ＝ αＲ１＋ βＲ２＋ γＲ３・・・（１）
によって、算出することができる。 Alternatively, the score may be calculated by weighting the ratio of the number of links for each feature. As the weighting coefficient α of feature 1, the weighting coefficient β of feature 2, and the weighting coefficient γ of feature 3, the score S is
S = αR1 + βR2 + γR3 (1)
Can be calculated.

図１１は、辞書候補テーブルとそのエントリ候補テーブルの一例を示した図である。（Ｈ）は辞書候補テーブル、（Ｉ）はエントリ候補テーブルの一例である。どちらも辞書候補ＤＢ１１７に格納される。 FIG. 11 is a diagram showing an example of a dictionary candidate table and its entry candidate table. (H) is an example of a dictionary candidate table, and (I) is an example of an entry candidate table. Both are stored in the dictionary candidate DB 117.

（Ｈ）辞書候補テーブル１１７１は、サイトＵＲＬ（ドメイン）１１７１ａ及びスコア１１７１ｂの各情報項目を有する。
サイトＵＲＬ（ドメイン）１１７１ａには、辞書候補であると判定された対象のサイトのＵＲＬが登録される。 (H) The dictionary candidate table 1171 includes information items of a site URL (domain) 1171a and a score 1171b.
In the site URL (domain) 1171a, the URL of the target site determined to be a dictionary candidate is registered.

スコア１１７１ｂには、式（１）によって算出された辞書らしさスコアが格納される。
（Ｉ）エントリ候補テーブル１１７２は、処理対象ＵＲＬ（ドメイン）１１７２ａ、単語１１７２ｂ及びＵＲＬ１１７２ｃの各情報項目を有する。 The score 1171b stores a dictionary-like score calculated by the equation (1).
(I) The entry candidate table 1172 includes information items of a processing target URL (domain) 1172a, a word 1172b, and a URL 1172c.

処理対象ＵＲＬ（ドメイン）１１７２ａには、辞書候補であると判定された対象のサイトのＵＲＬが登録される。単語１１７２ｂには、このサイトについて作成された特徴１エントリテーブル１１３４、特徴２エントリテーブル１１３６及び特徴３エントリテーブル１１３９のいずれかに登録される単語が設定される。ＵＲＬには、同様にいずれかのＵＲＬが設定される。 In the processing target URL (domain) 1172a, the URL of the target site determined to be a dictionary candidate is registered. In the word 1172b, a word registered in any of the feature 1 entry table 1134, the feature 2 entry table 1136, and the feature 3 entry table 1139 created for this site is set. Similarly, any URL is set as the URL.

なお、辞書候補及びエントリ候補の登録の際には、辞書追加ルール（記憶装置）１１６に格納される辞書追加ルールに基づいて処理を行う。例えば、辞書候補と判定されたＵＲＬが、辞書追加ルールで登録が禁止されるＵＲＬと同じであれば、辞書候補への登録は行わない。また、エントリ候補の登録であれば、除外キーワードなどが設定されていた場合、除外キーワードと一致するアンカーテキストは、エントリ候補テーブル１１７２に登録しない。 When registering dictionary candidates and entry candidates, processing is performed based on the dictionary addition rules stored in the dictionary addition rule (storage device) 116. For example, if the URL determined as a dictionary candidate is the same as the URL prohibited from being registered by the dictionary addition rule, the registration to the dictionary candidate is not performed. In addition, in the case of entry candidate registration, if an exclusion keyword or the like is set, anchor text that matches the exclusion keyword is not registered in the entry candidate table 1172.

こうして辞書候補が決定された後、ユーザからの表示要求があれば、ユーザ提示部１７０は、クライアント装置２００の表示部に表示させる。
図１２は、辞書サイト候補一覧画面の一例を示した図である。 After the dictionary candidate is determined in this way, if there is a display request from the user, the user presenting unit 170 displays it on the display unit of the client device 200.
FIG. 12 is a diagram showing an example of a dictionary site candidate list screen.

辞書サイト候補一覧画面２１００は、クライアント装置２００に接続する表示装置に表示される。辞書サイト候補一覧画面２１００には、辞書サイト候補を示した辞書サイト候補一覧表２１０１、詳細エントリ選択ボタン２１０２、ＮＧサイト登録ボタン２１０３及び辞書サイト登録ボタン２１０４などが表示される。 The dictionary site candidate list screen 2100 is displayed on a display device connected to the client device 200. The dictionary site candidate list screen 2100 displays a dictionary site candidate list table 2101 showing candidate dictionary sites, a detailed entry selection button 2102, an NG site registration button 2103, a dictionary site registration button 2104, and the like.

辞書サイト候補一覧画面２１００は、チェック欄２１０１ａ、識別番号２１０１ｂ、ＵＲＬ２１０１ｃ、エントリ例２１０１ｄ及びスコア２１０１ｅが、辞書候補テーブル１１７１及びエントリ候補テーブル１１７２に基づいて表示される。 The dictionary site candidate list screen 2100 displays a check column 2101a, an identification number 2101b, a URL 2101c, an example entry 2101d, and a score 2101e based on the dictionary candidate table 1171 and the entry candidate table 1172.

チェック欄２１０１ａは、辞書サイトとして登録するサイト、もしくはＮＧサイトとして登録するサイトを選択するための欄である。チェックされたサイトが処理の対象となる。識別番号２１０１ｂは、辞書サイト候補に順に振られた番号である。ＵＲＬ２１０１ｃは、サイトのＵＲＬで、辞書候補テーブル１１７１のサイトＵＲＬ１１７１ａに基づいて表示される。エントリ例２１０１ｄは、この辞書サイト候補で参照可能なエントリの例であり、エントリ候補テーブル１１７２の該当するサイトのＵＲＬ１１７２ｃから任意の語句が選択され、そのＵＲＬが表示される。ここは、対応する単語１１７２ｂからアンカーテキストが選択され、表示されてもよい。スコア２１０１ｅには、このサイトの辞書らしさスコアが、辞書候補テーブル１１７１のスコア１１７１ｂから抽出され、表示されている。 The check column 2101a is a column for selecting a site to be registered as a dictionary site or a site to be registered as an NG site. Checked sites are processed. The identification number 2101b is a number assigned to the dictionary site candidates in order. The URL 2101c is a URL of the site, and is displayed based on the site URL 1171a of the dictionary candidate table 1171. The entry example 2101d is an example of an entry that can be referred to by this dictionary site candidate. An arbitrary word / phrase is selected from the URL 1172c of the corresponding site in the entry candidate table 1172, and the URL is displayed. Here, the anchor text may be selected from the corresponding word 1172b and displayed. In the score 2101e, the dictionary-like score of this site is extracted from the score 1171b of the dictionary candidate table 1171 and displayed.

詳細エントリ選択ボタン２１０２は、サイトごとに用意される。例えば、Ｎｏ．１の辞書サイト候補の詳細エントリ選択ボタン２１０２を操作すると、詳細エントリ選択画面２１１０が開かれ、Ｎｏ．１の辞書サイト候補に対応するエントリ候補一覧２１１１が表示される。エントリ候補一覧２１１１には、選択された辞書サイト候補のエントリ候補テーブル１１７２の登録が抽出され、表示される。チェック欄２１１１ａは、このエントリを登録するか否か選択するための欄である。識別番号２１１１ｂは、各エントリ候補に順に振られた番号である。ＵＲＬ２１０１ｃには、選択された辞書サイト候補のエントリ候補テーブル１１７２の単語１１７２ｂに登録されるアンカーテキストが、ＵＲＬ１１７２ｃのリンク情報を付加した状態で表示される。 A detailed entry selection button 2102 is prepared for each site. For example, no. 1 is operated, the detailed entry selection screen 2110 is opened. An entry candidate list 2111 corresponding to one dictionary site candidate is displayed. In the entry candidate list 2111, registrations of the entry candidate table 1172 of the selected dictionary site candidate are extracted and displayed. The check column 2111a is a column for selecting whether or not to register this entry. The identification number 2111b is a number assigned to each entry candidate in order. In the URL 2101c, the anchor text registered in the word 1172b of the entry candidate table 1172 of the selected dictionary site candidate is displayed with the link information of the URL 1172c added.

ユーザは、辞書サイト候補一覧画面２１００を表示し、辞書サイトを選択する。このとき、必要であれば、詳細エントリ選択画面２１１０を表示して、エントリ候補の内容を確認することができる。このとき、エントリ候補とするか否かも選択することができる。そして、辞書サイトに登録したい候補があれば、チェック欄２１０１ａをチェックし、「辞書サイトに登録」と記述された辞書サイト登録ボタン２１０４を操作する。これにより、選択された辞書サイト候補とエントリ候補が、辞書として辞書サイトテーブルに登録され、辞書ＤＢ１１８に格納される。また、辞書サイトとして登録したくないときは、「ＮＧサイトに登録」と記述されたＮＧサイト登録ボタン２１０３を操作する。これにより、この辞書サイトが辞書サイト候補から削除される。このとき、今後このサイトを辞書サイト候補としないように、辞書追加ルールにＮＧサイトとして登録し、辞書追加ルール（記憶装置）１１６に格納するとしてもよい。 The user displays a dictionary site candidate list screen 2100 and selects a dictionary site. At this time, if necessary, a detailed entry selection screen 2110 can be displayed to confirm the contents of the entry candidate. At this time, whether or not to be an entry candidate can also be selected. If there is a candidate to be registered in the dictionary site, the check column 2101a is checked, and the dictionary site registration button 2104 described as “Register in dictionary site” is operated. As a result, the selected dictionary site candidate and entry candidate are registered in the dictionary site table as a dictionary and stored in the dictionary DB 118. If it is not desired to register as a dictionary site, the NG site registration button 2103 described as “Register in NG site” is operated. As a result, this dictionary site is deleted from the dictionary site candidates. At this time, the site may be registered as an NG site in the dictionary addition rule and stored in the dictionary addition rule (storage device) 116 so that this site will not be a dictionary site candidate in the future.

このように、ユーザが例となる単語をキーワードとして設定すると、そのジャンルの辞書サイトが自動で検出され、同時に辞書のエントリ（用語と解説ページのＵＲＬのペア）が抽出される。ユーザは、これを辞書サイト候補一覧画面２１００で確認し、辞書として用いるかどうかを決めるだけでよいので、簡単に辞書追加ができる。また、定期的に実行させれば、辞書のメンテナンスも容易になる。 In this way, when a user sets an example word as a keyword, a dictionary site of that genre is automatically detected, and at the same time, a dictionary entry (a pair of a term and an explanation page URL) is extracted. Since the user only has to confirm this on the dictionary site candidate list screen 2100 and decide whether or not to use it as a dictionary, the dictionary can be easily added. Also, if it is executed regularly, dictionary maintenance becomes easy.

以下、上記の辞書サイト検出システムにおける辞書サイト検出方法の処理手順について、フローチャートを用いて説明する。
図１３は、辞書サイト検出方法の全体の処理手順を示したフローチャートである。このフローチャートは、キーワードが入力されてから辞書サイト候補の提示までの処理手順を示している。ユーザが設定したキーワードが入力され、処理が開始される。 Hereinafter, a processing procedure of the dictionary site detection method in the dictionary site detection system will be described with reference to a flowchart.
FIG. 13 is a flowchart showing the entire processing procedure of the dictionary site detection method. This flowchart shows a processing procedure from input of a keyword to presentation of dictionary site candidates. The keyword set by the user is input and the process is started.

［ステップＳ０１］サイト取得部１２０が、入力されたキーワードに基づいて、このキーワードが含まれるページを有するサイトを検索する。検索されたサイトについて、トップページと、トップページにリンクされる他のページを取得するサイト取得処理を行う。取得されたページ群は、サイトごとに取得サイト（記憶装置）１１２に格納される。サイト取得処理の詳細は後述する。 [Step S01] Based on the input keyword, the site acquisition unit 120 searches for a site having a page including the keyword. A site acquisition process is performed for acquiring the top page and other pages linked to the top page for the searched site. The acquired page group is stored in the acquisition site (storage device) 112 for each site. Details of the site acquisition process will be described later.

［ステップＳ０２］リンク情報抽出部１３０が、ステップＳ０１のサイト取得処理によって取得サイト（記憶装置）１１２に格納されたサイトから１つを候補サイトに選択し、そのページ群を解析する。そして、ページからリンク情報を抽出し、ＵＲＬ−タイトルテーブル１１３１及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２に登録する。リンク情報抽出処理の詳細は後述する。 [Step S02] The link information extraction unit 130 selects one candidate site from the sites stored in the acquisition site (storage device) 112 by the site acquisition process in step S01, and analyzes the page group. Then, link information is extracted from the page and registered in the URL-title table 1131 and the anchor text-link destination URL table 1132. Details of the link information extraction processing will be described later.

［ステップＳ０３］リンク特徴集計部１４０が、ＵＲＬ−タイトルテーブル１１３１及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２を解析する。そして、リンク情報に、辞書サイト特有の特徴を示す特徴項目が検出された数を集計する。そして、特徴ごとに集計情報と、エントリテーブルとを生成する。リンク特徴集計処理の詳細は後述する。 [Step S03] The link feature totaling unit 140 analyzes the URL-title table 1131 and the anchor text-link destination URL table 1132. Then, the number of feature items indicating features unique to the dictionary site is added to the link information. Then, total information and an entry table are generated for each feature. Details of the link feature aggregation processing will be described later.

［ステップＳ０４］辞書サイト判定部１５０が、辞書サイト判定ルール１１５（記憶装置）に記憶される辞書サイト判定ルールに基づいて、集計情報から辞書らしさスコアを算出する。そして辞書らしさスコアが所定値以上のサイトを辞書サイト候補と判定する。辞書サイト判定処理の詳細は後述する。 [Step S04] The dictionary site determination unit 150 calculates a dictionary-likeness score from the aggregate information based on the dictionary site determination rule stored in the dictionary site determination rule 115 (storage device). A site having a dictionary-likeness score of a predetermined value or more is determined as a dictionary site candidate. Details of the dictionary site determination process will be described later.

［ステップＳ０５］ステップＳ０４の辞書サイト判定処理により、このサイトが辞書サイトと判定されたかどうかをチェックする。辞書サイトと判定されたときは、処理をステップＳ０６に進める。辞書サイトと判定されなかったときは、処理をステップＳ０７に進める。 [Step S05] It is checked whether or not this site is determined to be a dictionary site by the dictionary site determination processing in step S04. If it is determined to be a dictionary site, the process proceeds to step S06. If it is not determined to be a dictionary site, the process proceeds to step S07.

［ステップＳ０６］辞書サイトと判定されたときは、途中抽出されたエントリテーブルに基づいて、辞書エントリ候補を作成する。辞書エントリ候補作成処理の詳細は後述する。 [Step S06] When the dictionary site is determined, dictionary entry candidates are created based on the entry table extracted halfway. Details of the dictionary entry candidate creation processing will be described later.

［ステップＳ０７］キーワードによって検出された全候補サイトの処理が終了したかどうかを判定する。全候補サイトの処理が終了したときは、処理をステップＳ０８に進める。全候補サイトの処理が終了していないときは、ステップＳ０２に戻って、次の候補サイトの処理を行う。 [Step S07] It is determined whether or not the processing of all candidate sites detected by the keyword is completed. When the processes for all candidate sites are completed, the process proceeds to step S08. If all candidate sites have not been processed, the process returns to step S02 to process the next candidate site.

［ステップＳ０８］全候補サイトの処理が終了したときは、辞書サイト候補に登録されたサイトと、そのエントリとをユーザに提示する処理を行う。ユーザ提示処理の詳細は後述する。 [Step S08] When processing of all candidate sites is completed, a process of presenting sites registered as dictionary site candidates and their entries to the user is performed. Details of the user presentation process will be described later.

以上の処理手順が実行されることにより、入力されたキーワードに基づいて所望のジャンルの辞書サイトの可能性があるサイトが検出され、そのサイト内にページ間のリンク情報に基づきそのサイトが辞書サイトとしての特徴を有しているかどうかが判定される。そして、辞書サイト候補と判定されたときは、そのサイトがユーザに提示される。また、このとき同時に、リンク情報からオートリンク辞書の作成に必要な辞書エントリ候補も生成される。これにより、ユーザは、所望のジャンルに関連するキーワードを設定するだけで、簡単に所望のジャンルの辞書サイトを検出することができる。また、辞書のエントリ（単語とその解説ページのＵＲＬとを対応付けた情報）も同時に得ることができるため、オートリンク辞書の作成が容易になる。 By executing the above processing procedure, a site that may be a dictionary site of a desired genre is detected based on the input keyword, and the site is found in the site based on link information between pages. It is determined whether or not it has the characteristics as follows. And when it determines with a dictionary site candidate, the site is shown to a user. At the same time, dictionary entry candidates necessary for creating an autolink dictionary are also generated from the link information. As a result, the user can easily detect a dictionary site of a desired genre simply by setting a keyword related to the desired genre. In addition, dictionary entries (information in which words are associated with URLs of their explanation pages) can be obtained at the same time, making it easy to create an autolink dictionary.

以下、各処理の詳細を説明する。
図１４は、サイト取得処理の手順を示したフローチャートである。
キーワードが入力されて処理が開始される。 Details of each process will be described below.
FIG. 14 is a flowchart showing a procedure of site acquisition processing.
A keyword is input and processing starts.

［ステップＳ１１］拡張検索ルール（記憶装置）１１１に格納される拡張検索ルールを読み出し、キーワードを拡張検索ルールに基づき変化させ、クエリを拡張する。
［ステップＳ１２］ステップＳ１１で作成されたクエリを用いて検索を行う。これにより、キーワードを含むページが検出される。なお、検索では、複数のキーワードを受け付け、それぞれの検索結果のＡＮＤをとるなどしてもよい。 [Step S11] The extended search rule stored in the extended search rule (storage device) 111 is read, the keyword is changed based on the extended search rule, and the query is extended.
[Step S12] A search is performed using the query created in Step S11. Thereby, a page including the keyword is detected. In the search, a plurality of keywords may be accepted and AND of the respective search results may be taken.

［ステップＳ１３］ステップＳ１２で検索されたページの１つを選択し、そのトップページを検出する。検出したトップページは、取得サイト（記憶装置）１１２に格納する。 [Step S13] One of the pages searched in step S12 is selected, and the top page is detected. The detected top page is stored in the acquisition site (storage device) 112.

［ステップＳ１４］ステップＳ１３で検出されたトップページに記述されるリンク情報を抽出する。
［ステップＳ１５］ステップＳ１４で検出されたリンク情報に基づき、リンクされる配下の解説ページを取得する。取得した解説ページは、取得サイト（記憶装置）１１２に格納する。 [Step S14] Link information described in the top page detected in step S13 is extracted.
[Step S15] Based on the link information detected in step S14, a subordinate comment page to be linked is acquired. The acquired explanation page is stored in the acquisition site (storage device) 112.

［ステップＳ１６］ステップＳ１５で取得した解説ページ内に記述されるリンク情報を抽出する。
［ステップＳ１７］ステップ１４およびステップＳ１６で抽出したリンク情報に基づき，トップページからリンクでたどれるすべての解説ページが収集されたかどうかを判定する。収集されたときは、処理をステップＳ１７に進める。すべて収集し終わっていないときは、処理をステップＳ１５に戻し、次の解説ページを収集する。 [Step S16] Link information described in the comment page acquired in step S15 is extracted.
[Step S17] Based on the link information extracted in Step 14 and Step S16, it is determined whether or not all the explanation pages traced by links from the top page have been collected. If collected, the process proceeds to step S17. If not all have been collected, the process returns to step S15 to collect the next commentary page.

［ステップＳ１８］ステップＳ１２で検索された全検索ページについて処理が終了したかどうかを判定する。終了していないときは、ステップＳ１３に戻って、検索結果の次のサイトを取得する処理を行う。終了したときは、サイト取得処理を終了する。 [Step S18] It is determined whether or not the processing has been completed for all search pages searched in step S12. If not completed, the process returns to step S13 to perform processing for acquiring the next site of the search result. When finished, the site acquisition process is finished.

以上の処理手順が実行されることにより、キーワードに基づいて検索されたページが属するサイトのトップページと、トップページにリンクされる配下のページが収集され、取得サイト（記憶装置）１１２に格納される。 By executing the above processing procedure, the top page of the site to which the page searched based on the keyword belongs and the subordinate pages linked to the top page are collected and stored in the acquisition site (storage device) 112. The

図１５は、リンク情報抽出処理の手順を示したフローチャートである。
サイト取得処理（ステップＳ０１）によって取得サイト（記憶装置）１１２に格納された１つのサイトを候補サイトとし、そのページ群（トップページと解説ページ群）を読み出し、処理を開始する。 FIG. 15 is a flowchart showing the procedure of link information extraction processing.
One site stored in the acquisition site (storage device) 112 by the site acquisition process (step S01) is set as a candidate site, the page group (top page and comment page group) is read, and the process is started.

［ステップＳ２１］候補サイトのドメイン名を抽出する。例えば、トップページのＵＲＬからドメイン名を抽出する。抽出されたドメイン名は、他の情報項目登録時に、ＵＲＬ−タイトルテーブル１１３１のドメイン１１３１ａ及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２のドメイン１１３２ａに格納される。 [Step S21] The domain name of the candidate site is extracted. For example, the domain name is extracted from the URL of the top page. The extracted domain name is stored in the domain 1131a of the URL-title table 1131 and the domain 1132a of the anchor text-link destination URL table 1132 when other information items are registered.

［ステップＳ２２］候補サイトのページ群から未処理のページを１ページ取り出す。処理対象のページは、トップページ、解説ページのどちらも含む。
［ステップＳ２３］取り出したページのＨＴＭＬの解析を行う。また、取り出したページのＵＲＬをアンカーテキスト−リンク先ＵＲＬテーブル１１３２の処理対象ＵＲＬ１１３２ｂに格納する。 [Step S22] One unprocessed page is extracted from the page group of the candidate site. The pages to be processed include both the top page and the explanation page.
[Step S23] The HTML of the extracted page is analyzed. Further, the URL of the extracted page is stored in the processing target URL 1132 b of the anchor text-link destination URL table 1132.

［ステップＳ２４］ページ内のリンク情報Ａを抽出し、ＵＲＬ−タイトルテーブル１１３１に登録する。処理の詳細は、後述する。
［ステップＳ２５］ページ内のリンク情報Ｂを抽出し、アンカーテキスト−リンク先ＵＲＬテーブル１１３２に登録する。処理の詳細は、後述する。 [Step S24] The link information A in the page is extracted and registered in the URL-title table 1131. Details of the processing will be described later.
[Step S25] The link information B in the page is extracted and registered in the anchor text-link destination URL table 1132. Details of the processing will be described later.

［ステップＳ２６］候補サイトに属する全ページの処理が終了したかどうかを判定する。終了していないときは、ステップＳ２２に戻って、次のページの処理を行う。終了したときは、リンク情報抽出処理を終了する。 [Step S26] It is determined whether or not the processing of all pages belonging to the candidate site has been completed. If not completed, the process returns to step S22 to process the next page. When finished, the link information extraction process is finished.

図１６は、リンク情報Ａ抽出処理の手順を示したフローチャートである。
リンク情報Ａ（リンク先のページのＵＲＬとそのページのタイトル）を抽出し、ＵＲＬ−タイトルテーブル１１３１に登録する処理を行う。 FIG. 16 is a flowchart showing the procedure of the link information A extraction process.
The link information A (the URL of the linked page and the title of the page) is extracted and registered in the URL-title table 1131.

［ステップＳ２４１］処理対象のページから未処理のリンク情報を１つ抽出する。
［ステップＳ２４２］リンク先ＵＲＬの指定情報を抽出し、ＵＲＬが相対パスであれば、処理対象のページのファイル位置からリンク先のページのＵＲＬを絶対パスに変換する。リンク先ＵＲＬは、処理対象のページからの相対パスで記載されている場合があるため、このような相対パスを絶対パスに変換する。 [Step S241] One unprocessed link information is extracted from the page to be processed.
[Step S242] The designation information of the link destination URL is extracted, and if the URL is a relative path, the URL of the link destination page is converted into an absolute path from the file position of the processing target page. Since the link destination URL may be described as a relative path from the page to be processed, such a relative path is converted into an absolute path.

［ステップＳ２４３］ステップＳ２４２で抽出されたリンク先ＵＲＬのページを取得し、ＨＴＭＬを解析して、タイトルを抽出する。
［ステップＳ２４４］ステップＳ２４２で絶対パスに変換したリンク先のページのＵＲＬをＵＲＬ−タイトルテーブル１１３１のＵＲＬ１１３１ｂに登録する。また、ステップＳ２４３で抽出されたリンク先のページのタイトルを先に登録したＵＲＬに対応付けてタイトル１１３１ｃに登録する。前段で抽出されたドメイン名もドメイン１１３１ａに登録する。ＵＲＬ−タイトルテーブル１１３１は、リンク特徴ＤＢ１１３に格納する。 [Step S243] The page of the link destination URL extracted in Step S242 is acquired, the HTML is analyzed, and the title is extracted.
[Step S244] The URL of the link destination page converted into the absolute path in step S242 is registered in the URL 1131b of the URL-title table 1131. In addition, the title of the linked page extracted in step S243 is registered in the title 1131c in association with the previously registered URL. The domain name extracted in the previous stage is also registered in the domain 1131a. The URL-title table 1131 is stored in the link feature DB 113.

［ステップＳ２４５］処理対象のページの最後のリンク情報であるかどうかを判定する。最後のリンク情報でないときは、ステップＳ２４１に戻って次のリンク情報の処理を行う。最後のリンク情報であれば、リンク情報Ａ抽出処理を終了する。 [Step S245] It is determined whether it is the last link information of the page to be processed. If it is not the last link information, the process returns to step S241 to process the next link information. If it is the last link information, the link information A extraction process is terminated.

以上の処理手順が実行されることにより、処理対象のページのリンク情報に記述されるリンク先のＵＲＬと、リンク先のページのタイトルとが、ＵＲＬ−タイトルテーブル１１３１に登録される。 By executing the above processing procedure, the link destination URL described in the link information of the processing target page and the title of the link destination page are registered in the URL-title table 1131.

図１７は、リンク情報Ｂ抽出処理の手順を示したフローチャートである。
リンク情報Ｂ（処理対象のページのＵＲＬ、アンカーテキスト及びリンク先のページのＵＲＬ）を抽出し、アンカーテキスト−リンク先ＵＲＬテーブル１１３２に登録する処理を行う。 FIG. 17 is a flowchart showing the procedure of the link information B extraction process.
The link information B (the URL of the processing target page, the anchor text, and the URL of the link destination page) is extracted and registered in the anchor text-link destination URL table 1132.

［ステップＳ２５１］処理対象のページから未処理のリンク情報を１つ抽出する。
［ステップＳ２５２］リンク情報から、アンカーテキストと、リンク先ＵＲＬとを抽出する。 [Step S251] One piece of unprocessed link information is extracted from the page to be processed.
[Step S252] An anchor text and a link destination URL are extracted from the link information.

［ステップＳ２５３］ステップＳ２５２で抽出されたリンク先ＵＲＬが相対パスであるかどうかを判定する。相対パスであれば、処理をステップＳ２５４に進める。相対パスでなければ、処理をステップＳ２５５に進める。 [Step S253] It is determined whether the link destination URL extracted in step S252 is a relative path. If it is a relative path, the process proceeds to step S254. If it is not a relative path, the process proceeds to step S255.

［ステップＳ２５４］リンク先ＵＲＬが相対パスであったときは、処理対象のページのファイル位置からリンク先のページのＵＲＬを絶対パスに変換する。
［ステップＳ２５５］ステップＳ２５２で抽出されたアンカーテキストを、アンカーテキスト−リンク先ＵＲＬ１１３２のアンカーテキスト１１３２ｃに登録する。また、リンク先ＵＲＬ（絶対パス）も、アンカーテキストに対応付けて、リンク先ＵＲＬ１１３２ｄに登録する。前段で抽出されたドメイン名はドメイン１１３２ａ、処理対象のページのＵＲＬは処理対象ＵＲＬ１１３２ｂに登録する。アンカーテキスト−リンク先ＵＲＬ１１３２は、リンク特徴ＤＢ１１３に格納される。 [Step S254] When the link destination URL is a relative path, the URL of the link destination page is converted into an absolute path from the file position of the page to be processed.
[Step S255] The anchor text extracted in step S252 is registered in the anchor text 1132c of the anchor text-link destination URL 1132. The link destination URL (absolute path) is also registered in the link destination URL 1132d in association with the anchor text. The domain name extracted in the previous stage is registered in the domain 1132a, and the URL of the processing target page is registered in the processing target URL 1132b. The anchor text-link destination URL 1132 is stored in the link feature DB 113.

［ステップＳ２５６］処理対象のページの最後のリンク情報であるかどうかを判定する。最後のリンク情報でないときは、ステップＳ２５１に戻って次のリンク情報の処理を行う。最後のリンク情報であれば、リンク情報（特徴２）抽出処理を終了する。 [Step S256] It is determined whether it is the last link information of the page to be processed. If it is not the last link information, the process returns to step S251 to process the next link information. If it is the last link information, the link information (feature 2) extraction process is terminated.

以上の処理手順が実行されることにより、処理対象のページのリンク情報に記述されるアンカーテキストと、リンク先ＵＲＬとが、アンカーテキスト−リンク先テーブル１１３２に登録される。 By executing the above processing procedure, the anchor text described in the link information of the processing target page and the link destination URL are registered in the anchor text-link destination table 1132.

図１８は、リンク特徴集計処理の手順を示したフローチャートである。
ＵＲＬ−タイトルテーブル１１３１及びアンカーテキスト−リンク先ＵＲＬテーブル１１３２を用いて、特徴ごとのデータ集計を行う。 FIG. 18 is a flowchart showing the procedure of the link feature totaling process.
Using the URL-title table 1131 and the anchor text-link destination URL table 1132, data for each feature is aggregated.

［ステップＳ３１］リンク特徴ルール（記憶装置）１１４からリンク特徴ルールを読み出す。リンク特徴ルールに基づいて、特徴１の分析を行うか否かを判定する。分析を行うときは、処理をステップＳ３２に進める。分析を行わないときは、処理をステップＳ３３に進める。 [Step S31] A link feature rule is read from the link feature rule (storage device) 114. It is determined whether or not to analyze feature 1 based on the link feature rule. When analyzing, the process proceeds to step S32. When the analysis is not performed, the process proceeds to step S33.

［ステップＳ３２］特徴１の分析を行うと判定されたときは、特徴１に応じた分析を行い、特徴を満たすリンクの数を集計する。特徴１分析処理の詳細は後述する。
［ステップＳ３３］リンク特徴ルールに基づいて、特徴２の分析を行うか否かを判定する。分析を行うときは、処理をステップＳ３４に進める。分析を行わないときは、処理をステップＳ３５に進める。 [Step S32] When it is determined that the feature 1 is to be analyzed, the analysis according to the feature 1 is performed, and the number of links satisfying the feature is tabulated. Details of the feature 1 analysis process will be described later.
[Step S33] Based on the link feature rule, it is determined whether or not to analyze feature 2. When analyzing, the process proceeds to step S34. When the analysis is not performed, the process proceeds to step S35.

［ステップＳ３４］特徴２の分析を行うと判定されたときは、特徴２に応じた分析を行い、特徴を満たすリンクの数を集計する。特徴２分析処理の詳細は後述する。
［ステップＳ３５］リンク特徴ルールに基づいて、特徴３の分析を行うか否かを判定する。分析を行うときは、処理をステップＳ３６に進める。分析を行わないときは、リンク特徴集計処理を終了する。 [Step S34] When it is determined that the analysis of the feature 2 is performed, the analysis according to the feature 2 is performed, and the number of links satisfying the feature is totaled. Details of the feature 2 analysis processing will be described later.
[Step S35] Based on the link feature rule, it is determined whether to analyze feature 3. When analyzing, the process proceeds to step S36. When the analysis is not performed, the link feature totaling process is terminated.

［ステップＳ３６］特徴３の分析を行うと判定されたときは、特徴３に応じた分析を行い、特徴を満たすリンクの数を集計する。特徴３分析処理の詳細は後述する。特徴３分析処理の終了後、リンク特徴集計処理を終了する。 [Step S36] When it is determined that the analysis of the feature 3 is performed, the analysis according to the feature 3 is performed and the number of links satisfying the feature is totaled. Details of the feature 3 analysis process will be described later. After the feature 3 analysis process is completed, the link feature aggregation process is terminated.

以上の処理手順が実行されることにより、特徴１、特徴２、特徴３のうち、任意の特徴を用いて辞書サイトの判定を行うことができる。
図１９は、特徴１分析処理の手順を示したフローチャートである。 By executing the above processing procedure, the dictionary site can be determined using any one of the features 1, 2, and 3.
FIG. 19 is a flowchart showing the procedure of the feature 1 analysis process.

特徴１分析処理では、特徴１に基づき、リンク元のアンカーテキストとリンク先ページのタイトルタグ内の文字列（タイトル）とが一致するリンクの数を特徴１カウンタとして集計する。 In the feature 1 analysis processing, based on feature 1, the number of links where the link source anchor text matches the character string (title) in the title tag of the linked page is counted as a feature 1 counter.

［ステップＳ３２１］アンカーテキスト−リンク先ＵＲＬテーブル１１３２から１行読み出す。ドメイン、処理対象ＵＲＬ、アンカーテキスト及びリンク先ＵＲＬが読み出される。 [Step S321] One line is read from the anchor text-link destination URL table 1132. The domain, processing target URL, anchor text, and link destination URL are read out.

［ステップＳ３２２］ステップＳ３２１で読み出したドメインに該当する特徴１カウンタテーブル１１３３の全リンクカウンタ１１３３ｂを１増やして格納する。特徴１カウンタテーブル１１３３のドメイン１１３３ａに該当するドメインが設定されていなかったときは、新たにレコードを作成し、対応する全リンクカウンタ１１３３ｂに１を設定する。 [Step S322] All link counters 1133b of the feature 1 counter table 1133 corresponding to the domain read in step S321 are incremented by 1 and stored. When the domain corresponding to the domain 1133a of the feature 1 counter table 1133 is not set, a new record is created and 1 is set to the corresponding all link counter 1133b.

［ステップＳ３２３］ステップＳ３２１で読み出したリンク先ＵＲＬは、自サイト内のリンクであるかどうかを判定する。自サイト内のリンクであれば、処理をステップＳ３２４に進める。自サイト内のリンクでなければ、処理をステップＳ３２９に進める。 [Step S323] It is determined whether the link destination URL read in step S321 is a link in the own site. If so, the process advances to step S324. If the link is not within the local site, the process proceeds to step S329.

［ステップＳ３２４］ステップＳ３２１で読み出したドメインに該当する特徴１カウンタテーブル１１３３の内部リンクカウンタ１１３３ｃを１増やして格納する。
［ステップＳ３２５］ＵＲＬ−タイトルテーブル１１３１のＵＲＬ１１３１ｂを検索し、ステップＳ３２１で読み出したリンク先ＵＲＬと同じＵＲＬが登録される行を検出する。そして、検出されたＵＲＬに対応するタイトル１１３１ｃからタイトルを取り出す。 [Step S324] The internal link counter 1133c of the feature 1 counter table 1133 corresponding to the domain read in step S321 is incremented by 1 and stored.
[Step S325] The URL 1131b of the URL-title table 1131 is searched, and a line in which the same URL as the link destination URL read in step S321 is registered is detected. Then, a title is extracted from the title 1131c corresponding to the detected URL.

［ステップＳ３２６］ステップＳ３２１で読み出したアンカーテキストが、ステップＳ３２５で取り出したタイトルの中に含まれているかどうかを判定する。アンカーテキストがタイトルに含まれているときは、処理をステップＳ３２７に進める。含まれていないときは、処理をステップＳ３２９に進める。 [Step S326] It is determined whether the anchor text read in step S321 is included in the title extracted in step S325. If the anchor text is included in the title, the process proceeds to step S327. If not included, the process proceeds to step S329.

［ステップＳ３２７］アンカーテキストがタイトルに含まれているときは、特徴１カウンタテーブル１１３３の該当するドメインの行の特徴１カウンタ１１３３ｄを１増やして格納する。 [Step S327] When the anchor text is included in the title, the feature 1 counter 1133d of the corresponding domain row in the feature 1 counter table 1133 is incremented by 1 and stored.

［ステップＳ３２８］特徴１エントリテーブル１１３４にアンカーテキストがタイトルに含まれているリンク情報をエントリする。処理対象のページのＵＲＬは、処理対象ＵＲＬ（ドメイン）１１３４ａ、アンカーテキストは単語１１３４ｂ、そしてリンク先ＵＲＬはＵＲＬ１１３４ｃに登録する。 [Step S328] The link information in which the anchor text is included in the title is entered in the feature 1 entry table 1134. The URL of the processing target page is registered in the processing target URL (domain) 1134a, the anchor text is registered in the word 1134b, and the link destination URL is registered in the URL 1134c.

［ステップＳ３２９］アンカーテキスト−リンク先ＵＲＬテーブル１１３２の処理対象ＵＲＬ１１３２ｂに、未処理のＵＲＬが残っているかどうかを判定する。残っていれば、処理をステップＳ３２１に戻し、次の処理対象ＵＲＬについて処理を行う。残っていなければ、特徴１分析処理を終了する。 [Step S329] It is determined whether or not an unprocessed URL remains in the processing target URL 1132b of the anchor text-link destination URL table 1132. If it remains, the process returns to step S321, and the next process target URL is processed. If not, the feature 1 analysis process is terminated.

以上の処理手順が実行されることにより、全リンクカウンタ、内部リンクカウンタ及び特徴１を満たしたリンクの数を集計した特徴１カウンタが得られる。集計結果は、処理対象のサイトのドメインに対応付けて、特徴１カウンタテーブル１１３３に登録される。また、このとき同時に、特徴１の要件を満たすアンカーテキストとリンク先ＵＲＬとを対応付けた特徴１エントリテーブル１１３４も生成される。 By executing the above processing procedure, a total link counter, an internal link counter, and a feature 1 counter in which the number of links satisfying the feature 1 are totaled are obtained. The aggregation result is registered in the feature 1 counter table 1133 in association with the domain of the processing target site. At the same time, the feature 1 entry table 1134 in which the anchor text satisfying the requirement of the feature 1 is associated with the link destination URL is also generated.

図２０は、特徴２分析処理の手順を示したフローチャートである。
特徴２分析処理では、特徴２に基づき、リンク元のアンカーテキストとリンク先ページのファイル名とが一致するリンクの数を特徴２カウンタとして集計する。 FIG. 20 is a flowchart showing the procedure of the feature 2 analysis process.
In the feature 2 analysis process, based on the feature 2, the number of links whose link source anchor text matches the file name of the link destination page is counted as a feature 2 counter.

［ステップＳ３４１］アンカーテキスト−リンク先ＵＲＬテーブル１１３２から１行読み出す。ドメイン、処理対象ＵＲＬ、アンカーテキスト及びリンク先ＵＲＬが読み出される。 [Step S341] One line is read from the anchor text-link destination URL table 1132. The domain, processing target URL, anchor text, and link destination URL are read out.

［ステップＳ３４２］ステップＳ３４１で読み出したドメインに該当する特徴２カウンタテーブル１１３５の全リンクカウンタ１１３５ｂを１増やして格納する。特徴２カウンタテーブル１１３５のドメイン１１３５ａに該当するドメインが設定されていなかったときは、新たにレコードを作成し、対応する全リンクカウンタ１１３５ｂに１を設定する。 [Step S342] The link counter 1135b of the feature 2 counter table 1135 corresponding to the domain read in step S341 is incremented by 1 and stored. When a domain corresponding to the domain 1135a of the feature 2 counter table 1135 is not set, a new record is created and 1 is set to the corresponding all link counter 1135b.

［ステップＳ３４３］ステップＳ３４１で読み出したリンク先ＵＲＬは、自サイト内のリンクであるかどうかを判定する。自サイト内のリンクであれば、処理をステップＳ３４４に進める。自サイト内のリンクでなければ、処理をステップＳ３４６に進める。 [Step S343] It is determined whether the link destination URL read in step S341 is a link in the own site. If so, the process advances to step S344. If it is not a link in its own site, the process proceeds to step S346.

［ステップＳ３４４］ステップＳ３４１で読み出したドメインに該当する特徴２カウンタテーブル１１３５の内部リンクカウンタ１１３５ｃを１増やして格納する。
［ステップＳ３４５］ステップＳ３４１で読み出したアンカーテキストをＵＲＬエンコードする。一般に、ＵＲＬとして使用できない記号や全角文字などは、ＵＲＬに組み込む際にＵＲＬエンコード処理され、「％Ｅ３」などの半角文字の組み合わせに変換される。このため、リンク先ＵＲＬに含まれるファイル名とアンカーテキストとを照合する際には、アンカーテキストをＵＲＬエンコード処理しておく必要がある。 [Step S344] The internal link counter 1135c of the feature 2 counter table 1135 corresponding to the domain read in step S341 is incremented by 1 and stored.
[Step S345] The anchor text read in step S341 is URL-encoded. In general, symbols, double-byte characters, and the like that cannot be used as URLs are URL-encoded when they are incorporated into URLs, and converted to combinations of single-byte characters such as “% E3”. For this reason, when the file name included in the link destination URL is compared with the anchor text, the anchor text needs to be URL-encoded.

［ステップＳ３４６］ステップＳ３４５でＵＲＬエンコード処理されたアンカーテキストと、ステップＳ３４１で読み出したリンク先ＵＲＬに含まれるファイル名とを照合する。アンカーテキストとファイル名が一致するときは、処理をステップＳ３４７に進める。一致しないときは、処理をステップＳ３４９に進める。 [Step S346] The anchor text subjected to the URL encoding process in step S345 is collated with the file name included in the link destination URL read in step S341. If the anchor text matches the file name, the process proceeds to step S347. If not, the process proceeds to step S349.

［ステップＳ３４７］アンカーテキストとファイル名とが一致するときは、特徴２カウンタテーブル１１３５の該当するドメインの行の特徴２カウンタ１１３５ｄを１増やして格納する。 [Step S347] When the anchor text matches the file name, the feature 2 counter 1135d of the corresponding domain row in the feature 2 counter table 1135 is incremented by 1 and stored.

［ステップＳ３４８］特徴２エントリテーブル１１３５にアンカーテキストがタイトルに含まれているリンク情報をエントリする。処理対象のページのＵＲＬは、処理対象ＵＲＬ（ドメイン）１１３５ａ、アンカーテキストは単語１１３５ｂ、そしてリンク先ＵＲＬはＵＲＬ１１３５ｃに登録する。 [Step S348] The link information in which the anchor text is included in the title is entered in the feature 2 entry table 1135. The URL of the processing target page is registered in the processing target URL (domain) 1135a, the anchor text is registered in the word 1135b, and the link destination URL is registered in the URL 1135c.

［ステップＳ３４９］アンカーテキスト−リンク先ＵＲＬテーブル１１３２の処理対象ＵＲＬ１１３２ｂに、未処理のＵＲＬが残っているかどうかを判定する。残っていれば、処理をステップＳ３４１に戻し、次の処理対象ＵＲＬについて処理を行う。残っていなければ、特徴２分析処理を終了する。 [Step S349] It is determined whether or not an unprocessed URL remains in the processing target URL 1132b of the anchor text-link destination URL table 1132. If it remains, the process returns to step S341, and the next URL to be processed is processed. If not, the feature 2 analysis process is terminated.

以上の処理手順が実行されることにより、全リンクカウンタ、内部リンクカウンタ及び特徴２を満たしたリンクの数を集計した特徴２カウンタが得られる。集計結果は、処理対象のサイトのドメインに対応付けて、特徴２カウンタテーブル１１３５に登録される。なお、特徴１の分析処理を同時に行う場合には、いずれか一方で全リンクカウンタ及び内部リンクカウンタを集計する処理を行えばよい。また、このとき同時に、特徴２の要件を満たすアンカーテキストとリンク先ＵＲＬとを対応付けた特徴２エントリテーブル１１３６も生成される。 By executing the above processing procedure, a feature 2 counter in which the total link counter, the internal link counter, and the number of links satisfying the feature 2 are totaled is obtained. The aggregation result is registered in the feature 2 counter table 1135 in association with the domain of the processing target site. Note that when the analysis process of feature 1 is performed at the same time, the process of counting all link counters and internal link counters may be performed on either side. At the same time, a feature 2 entry table 1136 in which the anchor text satisfying the requirement of feature 2 and the link destination URL are associated is also generated.

図２１は、特徴３分析処理の手順を示したフローチャートである。
特徴３分析処理では、特徴３に基づき、リンク先ページのタイトルからアンカーテキストを除いた文字列を共通タイトルとして抽出し、共通タイトルが出現する数を集計する。 FIG. 21 is a flowchart showing the procedure of the feature 3 analysis process.
In the feature 3 analysis process, a character string obtained by removing the anchor text from the title of the link destination page is extracted as a common title based on the feature 3, and the number of appearances of the common title is totaled.

［ステップＳ３６１］アンカーテキスト−リンク先ＵＲＬテーブル１１３２から１行読み出す。ドメイン、処理対象ＵＲＬ、アンカーテキスト及びリンク先ＵＲＬが読み出される。 [Step S361] One line is read from the anchor text-link destination URL table 1132. The domain, processing target URL, anchor text, and link destination URL are read out.

［ステップＳ３６２］ステップＳ３６１で読み出したドメインに該当する特徴３カウンタテーブル１１３７の全リンクカウンタ１１３７ｂを１増やして格納する。特徴３カウンタテーブル１１３７のドメイン１１３７ａに該当するドメインが設定されていなかったときは、新たにレコードを作成し、対応する全リンクカウンタ１１３７ｂに１を設定する。 [Step S362] The link counter 1137b of the feature 3 counter table 1137 corresponding to the domain read in step S361 is incremented by 1 and stored. When the domain corresponding to the domain 1137a of the feature 3 counter table 1137 is not set, a new record is created and 1 is set to the corresponding all link counter 1137b.

［ステップＳ３６３］ステップＳ３６１で読み出したリンク先ＵＲＬは、自サイト内のリンクであるかどうかを判定する。自サイト内のリンクであれば、処理をステップＳ３６４に進める。自サイト内のリンクでなければ、処理をステップＳ３６９に進める。 [Step S363] It is determined whether the link destination URL read in step S361 is a link in the own site. If so, the process advances to step S364. If it is not a link in its own site, the process proceeds to step S369.

［ステップＳ３６４］ステップＳ３６１で読み出したドメインに該当する特徴３カウンタテーブル１１３７の内部リンクカウンタ１１３７ｃを１増やして格納する。
［ステップＳ３６５］ＵＲＬ−タイトルテーブル１１３１のＵＲＬ１１３１ｂを検索し、ステップＳ３６１で読み出したリンク先ＵＲＬと同じＵＲＬが登録される行を検出する。そして、検出されたＵＲＬに対応するタイトルをタイトル１１３１ｃから取り出す。 [Step S364] The internal link counter 1137c of the feature 3 counter table 1137 corresponding to the domain read in step S361 is incremented by 1 and stored.
[Step S365] The URL 1131b of the URL-title table 1131 is searched, and a line in which the same URL as the link destination URL read in step S361 is registered is detected. Then, the title corresponding to the detected URL is extracted from the title 1131c.

［ステップＳ３６６］ステップＳ３６５で取り出したタイトルに、ステップＳ３６１で読み出したアンカーテキストが含まれているかどうかを判定する。含まれているときは、処理をステップＳ３６７に進める。含まれていないときは、処理をステップＳ３６９に進める。 [Step S366] It is determined whether or not the anchor text read out in step S361 is included in the title extracted in step S365. If it is included, the process proceeds to step S367. If not included, the process proceeds to step S369.

［ステップＳ３６７］タイトルにアンカーテキストが含まれていたときは、タイトルからアンカーテキストを除いた文字列を抽出し、共通タイトルとする。抽出された共通タイトルと、共通タイトルテーブル１１３８の共通タイトル１１３８ｂとを照合し、一致するものがあれば、対応するカウンタ１１３８ｃを１増やして格納する。一致するものがなければ、共通タイトル１１３８ｂに新たにレコードとして登録し、対応するカウンタ１１３８ｃに１を設定する。 [Step S367] When the anchor text is included in the title, a character string excluding the anchor text from the title is extracted and set as a common title. The extracted common title and the common title 1138b of the common title table 1138 are collated, and if there is a match, the corresponding counter 1138c is incremented by 1 and stored. If there is no match, a new record is registered in the common title 1138b, and 1 is set in the corresponding counter 1138c.

［ステップＳ３６８］特徴３エントリテーブル１１３９にステップＳ３６７で共通タイトルを登録したリンク情報をエントリする。処理対象のページのＵＲＬは、処理対象ＵＲＬ（ドメイン）１１３９ａ、アンカーテキストは単語１１３９ｂ、リンク先ＵＲＬはＵＲＬ１１３９ｃ、そして共通タイトルは共通タイトル１１３９ｄに登録する。 [Step S368] The link information in which the common title is registered in Step S367 is entered in the feature 3 entry table 1139. The URL of the processing target page is registered in the processing target URL (domain) 1139a, the anchor text is registered in the word 1139b, the link destination URL is registered in the URL 1139c, and the common title is registered in the common title 1139d.

［ステップＳ３６９］アンカーテキスト−リンク先ＵＲＬテーブル１１３２の処理対象ＵＲＬ１１３２ｂに、未処理のＵＲＬが残っているかどうかを判定する。残っていれば、処理をステップＳ３６１に戻し、次の処理対象ＵＲＬについて処理を行う。残っていなければ、処理をステップＳ３７０に進める。 [Step S369] It is determined whether or not an unprocessed URL remains in the processing target URL 1132b of the anchor text-link destination URL table 1132. If it remains, the process returns to step S361 to perform the process for the next processing target URL. If not, the process proceeds to step S370.

［ステップＳ３７０］ステップＳ３６９までの処理により共通タイトルテーブル１１３８に登録されたカウンタ１１３８ｃのカウント値を比較し、最大のカウント値と、その共通タイトルとを抽出する。そして、特徴３カウンタテーブル１１３７の特徴３カウンタ１１３７ｄに抽出された最大のカウント値、共通タイトル１１３７ｅに最大のカウント値に対応する共通タイトルを登録する。 [Step S370] The count values of the counter 1138c registered in the common title table 1138 by the processing up to step S369 are compared, and the maximum count value and the common title are extracted. Then, the maximum count value extracted in the feature 3 counter 1137d of the feature 3 counter table 1137 and the common title corresponding to the maximum count value are registered in the common title 1137e.

［ステップＳ３７１］ステップＳ３７０で登録された共通タイトルを、特徴３エントリテーブル１１３９の共通タイトル１１３９と順次照合する。そして、共通タイトルが一致するものを除き、そのレコードを削除する。これにより、最大数のページで一致する共通タイトルを持つエントリのみが特徴３エントリテーブル１１３９に残る。 [Step S371] The common title registered in step S370 is sequentially checked against the common title 1139 in the feature 3 entry table 1139. The record is deleted except for those that have a common title. As a result, only the entries having the same common title on the maximum number of pages remain in the feature 3 entry table 1139.

以上の処理手順が実行されることにより、全リンクカウンタ、内部リンクカウンタ及び特徴３を満たした共通タイトルを有するリンクの数を集計した特徴３カウンタが得られる。集計結果は、処理対象のサイトのドメインに対応付けて、特徴３カウンタテーブル１１３７に登録される。なお、特徴１または特徴２の分析処理を同時に行う場合には、いずれか一方で全リンクカウンタ及び内部リンクカウンタを集計する処理を行えばよい。また、このとき同時に、特徴３の要件を満たすアンカーテキストとリンク先ＵＲＬとを対応付けた特徴３エントリテーブル１１３９も生成される。 By executing the above processing procedure, a feature 3 counter in which the total link counter, the internal link counter, and the number of links having a common title satisfying the feature 3 are totaled is obtained. The aggregation result is registered in the feature 3 counter table 1137 in association with the domain of the processing target site. In addition, when performing the analysis process of the feature 1 or the feature 2 at the same time, the process of counting the all link counter and the internal link counter may be performed on either one. At the same time, a feature 3 entry table 1139 in which the anchor text satisfying the requirement of the feature 3 is associated with the link destination URL is also generated.

こうしてリンク特徴集計処理が終了すると、辞書サイト判定部１５０は、検出されたサイトが辞書サイトであるかどうかの判定処理を行う。
図２２は、辞書サイト判定処理の手順を示したフローチャートである。 When the link feature totaling process is thus completed, the dictionary site determination unit 150 performs a determination process as to whether or not the detected site is a dictionary site.
FIG. 22 is a flowchart showing the procedure of the dictionary site determination process.

辞書サイト処理が開始されたときには、前の処理から処理中のドメイン名を引き継いでいるとする。
［ステップＳ４１］辞書サイト判定部１５０は、辞書サイト判定ルールを辞書サイト判定ルール（記憶装置）１１５から読み出す。 When dictionary site processing is started, it is assumed that the domain name being processed is inherited from the previous processing.
[Step S41] The dictionary site determination unit 150 reads the dictionary site determination rule from the dictionary site determination rule (storage device) 115.

［ステップＳ４２］辞書サイト判定部１５０は、ステップＳ４１で読み出した辞書サイト判定ルールに基づいて、辞書らしさスコアを算出する。このとき、該当する特徴１カウンタテーブル１１３３、特徴２カウンタテーブル１１３５、または特徴３カウンタテーブル１１３７に登録される該当サイトの集計結果を用いる。 [Step S42] The dictionary site determination unit 150 calculates a dictionary-likeness score based on the dictionary site determination rule read in step S41. At this time, the aggregation result of the corresponding site registered in the corresponding feature 1 counter table 1133, the feature 2 counter table 1135, or the feature 3 counter table 1137 is used.

［ステップＳ４３］ステップＳ４２で算出された辞書らしさスコアを、辞書サイト判定ルールに定義される閾値と比較する。辞書サイト判定ルールを満たしている場合は、辞書サイト候補と判定することができる。 [Step S43] The dictionary-likeness score calculated in step S42 is compared with a threshold defined in the dictionary site determination rule. If the dictionary site determination rule is satisfied, it can be determined as a dictionary site candidate.

［ステップＳ４４］ステップＳ４３による比較結果に基づいて、辞書サイト候補と判定されたときは、処理をステップＳ４５に進める。辞書サイト候補と判定されなかったときは、辞書サイト判定処理を終了する。 [Step S44] If the dictionary site candidate is determined based on the comparison result in step S43, the process proceeds to step S45. If it is not determined as a dictionary site candidate, the dictionary site determination process ends.

［ステップＳ４５］辞書サイト候補と判定されたときは、このサイトを辞書候補とし、辞書候補テーブル１１７１に登録する。処理対象ＵＲＬ（ドメイン）を辞書候補テーブル１１７１のサイトＵＲＬ（ドメイン）１１７１ａに登録する。また、算出された辞書らしさスコアは、スコア１１７１ｂに登録する。登録後、辞書サイト判定処理を終了する。 [Step S45] If it is determined as a dictionary site candidate, this site is set as a dictionary candidate and registered in the dictionary candidate table 1171. The processing target URL (domain) is registered in the site URL (domain) 1171a of the dictionary candidate table 1171. Also, the calculated dictionary-likeness score is registered in the score 1171b. After registration, the dictionary site determination process ends.

辞書サイト候補が決定すると、辞書エントリ候補作成部１６０は、辞書サイト候補について辞書エントリ候補を作成する処理を行う。
図２３は、辞書エントリ候補作成処理の手順を示したフローチャートである。 When the dictionary site candidate is determined, the dictionary entry candidate creation unit 160 performs a process of creating a dictionary entry candidate for the dictionary site candidate.
FIG. 23 is a flowchart showing a procedure of dictionary entry candidate creation processing.

［ステップＳ６１］リンク特徴ＤＢ１１３に格納されるエントリテーブル（特徴１エントリテーブル１１３４、特徴２エントリテーブル１１３６及び特徴３エントリテーブル１１３９）からエントリを読み出す。該当する処理対象ＵＲＬ（ドメイン）の単語１１３４ｂ，１１３６ｂ，１１３９ｂと、ＵＲＬ１１３４ｃ，１１３６ｃ，１１３９ｃと、のペアのエントリをひとつ取り出す。 [Step S61] An entry is read from the entry tables (feature 1 entry table 1134, feature 2 entry table 1136, and feature 3 entry table 1139) stored in the link feature DB 113. One entry of a pair of the word 1134b, 1136b, 1139b and the URL 1134c, 1136c, 1139c of the corresponding processing target URL (domain) is extracted.

［ステップＳ６２］ステップＳ６１で取り出されたエントリが、辞書候補ＤＢ１１７のエントリ候補テーブル１１７２の該当する処理対象ＵＲＬにこのエントリが登録されているかどうかを判定する。未登録であれば、処理をステップＳ６３に進める。未登録でなければ、処理をステップＳ６４に進める。 [Step S62] It is determined whether the entry extracted in step S61 is registered in the corresponding processing target URL of the entry candidate table 1172 of the dictionary candidate DB 117. If not registered, the process proceeds to step S63. If not registered, the process proceeds to step S64.

［ステップＳ６３］エントリ候補テーブル１１７２に未登録であれば、このエントリを新たなレコードとしてエントリ候補テーブル１１７２に登録する。
［ステップＳ６４］エントリがまだ残っているかどうかを判定する。次のエントリがあれば、処理をステップＳ６１に戻し、エントリの抽出からの処理を行う。次のエントリがなければ、辞書エントリ候補作成処理を終了する。 [Step S63] If the entry is not registered in the entry candidate table 1172, the entry is registered in the entry candidate table 1172 as a new record.
[Step S64] It is determined whether or not there are still entries. If there is a next entry, the process returns to step S61 to perform the process from the entry extraction. If there is no next entry, the dictionary entry candidate creation process is terminated.

以上の処理手順により、辞書サイト候補のエントリ候補テーブルが作成される。なお、エントリ候補登録のルールを予め決めておき、辞書追加ルール（記憶装置）１１６に格納しておいてもよい。例えば、除外キーワードやＵＲＬ指定、複数サイトで検出された単語だけ登録するなどのルールを設定しておく。辞書エントリ候補作成処理では、ルールに定義される条件を満たしたエントリのみをエントリ候補テーブル１１７２に登録する。 With the above processing procedure, an entry candidate table of dictionary site candidates is created. The entry candidate registration rule may be determined in advance and stored in the dictionary addition rule (storage device) 116. For example, rules such as exclusion keywords, URL designation, and registration of only words detected at a plurality of sites are set. In the dictionary entry candidate creation process, only entries that satisfy the conditions defined in the rule are registered in the entry candidate table 1172.

こうして辞書サイト候補が決定され、その辞書エントリ候補が作成されると、ユーザ提示部１７０は、辞書サイト候補と辞書エントリ候補とをユーザに提示する処理を行う。
図２４は、ユーザ提示処理の手順を示したフローチャートである。 When the dictionary site candidate is determined in this way and the dictionary entry candidate is created, the user presenting unit 170 performs a process of presenting the dictionary site candidate and the dictionary entry candidate to the user.
FIG. 24 is a flowchart showing the procedure of the user presentation process.

［ステップＳ８１］辞書候補ＤＢ１１７に格納される辞書候補テーブル１１７１及びエントリ候補テーブル１１７２に設定される辞書サイト候補の情報をクライアント装置２００へ出力する。クライアント装置２００では、取得した情報に基づいて図１２に示した辞書サイト候補一覧画面２１００を表示する。 [Step S <b> 81] Information on dictionary site candidates set in the dictionary candidate table 1171 and the entry candidate table 1172 stored in the dictionary candidate DB 117 is output to the client device 200. The client device 200 displays the dictionary site candidate list screen 2100 shown in FIG. 12 based on the acquired information.

［ステップＳ８２］クライアント装置２００を介してユーザからの指示が受け付けられるのを待つ。受け付けたときは、処理をステップＳ８３に進める。
［ステップＳ８３］ステップＳ８２で受け付けた指示が登録要求であったかどうかを判定する。登録要求であれば、処理をステップＳ８４に進める。登録要求でなければ、処理をステップＳ８６に進める。 [Step S82] Wait until an instruction from the user is received via the client device 200. If accepted, the process proceeds to step S83.
[Step S83] It is determined whether the instruction received in step S82 is a registration request. If it is a registration request, the process proceeds to step S84. If it is not a registration request, the process proceeds to step S86.

［ステップＳ８４］登録要求であったときは、この辞書サイト候補を辞書登録し、辞書ＤＢ１１８に格納する。
［ステップＳ８５］受け付け確認画面をクライアント装置２００へ出力し、ユーザ提示処理を終了する。 [Step S84] If it is a registration request, this dictionary site candidate is registered in the dictionary and stored in the dictionary DB 118.
[Step S85] An acceptance confirmation screen is output to the client device 200, and the user presentation process is terminated.

［ステップＳ８６］登録要求でなかったときは、要求された処理を実行し、ユーザ提示処理を終了する。
以上の処理手順が実行されることにより、ユーザが例として入力した単語に基づいて、そのジャンルの辞書サイトが自動で検出され、同時に辞書のエントリ（用語と解説ページのＵＲＬのペア）が抽出される。これにより、ユーザの辞書作成作業を大幅に軽減することが可能となる。また、定期的に実行させれば、辞書のメンテナンスも容易になる。 [Step S86] If it is not a registration request, the requested process is executed, and the user presentation process is terminated.
By executing the above processing procedure, a dictionary site of the genre is automatically detected based on a word input by the user as an example, and a dictionary entry (a pair of a term and an explanation page URL) is extracted at the same time. The As a result, the user's dictionary creation work can be greatly reduced. Also, if it is executed regularly, dictionary maintenance becomes easy.

なお、上記の処理機能は、コンピュータによって実現することができる。その場合、文書群検出装置が有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。 The above processing functions can be realized by a computer. In that case, a program describing the processing contents of the functions that the document group detection apparatus should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium.

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ（Digital Versatile Disc）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）などの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing the program, for example, portable recording media such as a DVD (Digital Versatile Disc) and a CD-ROM (Compact Disc Read Only Memory) on which the program is recorded are sold. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

以上の実施の形態に関し、更に以下の付記を開示する。
（付記１）ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出方法において、
前記コンピュータが、
特定文書の配下に複数の配下文書が存在する階層構造を成す文書群を対象にして、特定のキーワードを用いて該配下文書のいずれかを検索し、検索された該配下文書に基づいて前記特定文書を検出し、該特定文書の配下の複数の配下文書を収集する収集手順と、
収集された前記文書群の前記特定文書及び複数の配下文書それぞれについて、配下文書内の任意の文字列に付加される特定の他文書との関連を示す連結情報を抽出して、該配下文書と、関連付けられた連結先文書とで特定の関係となる状態数を集計する特徴集計手順と、
前記特定の関係を用いた条件である特徴ルールが格納される特徴ルール記憶手段から該特徴ルールを読み出し、前記文書群の特定の関係の状態数が該特徴ルールの条件を満たしているか判定し、条件を満たしている文書群を対象文書群候補に登録する文書群判定手順と、
前記対象文書群候補に登録された前記文書群を出力する手順と、
を有することを特徴とする文書群検出方法。 Regarding the above embodiment, the following additional notes are disclosed.
(Supplementary Note 1) In a document group detection method for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
The computer is
For a group of documents having a hierarchical structure in which a plurality of subordinate documents exist under the specific document, the subordinate document is searched using a specific keyword, and the specific is determined based on the searched subordinate document. A collection procedure for detecting a document and collecting a plurality of subordinate documents under the specific document;
For each of the collected specific document and the plurality of subordinate documents in the collected document group, connection information indicating a relation with a specific other document added to an arbitrary character string in the subordinate document is extracted, and the subordinate document and A feature counting procedure that counts the number of states that have a specific relationship with the associated linked document,
Read out the feature rule from the feature rule storage means in which the feature rule that is a condition using the specific relationship is stored, determine whether the number of states of the specific relationship of the document group satisfies the condition of the feature rule, A document group determination procedure for registering a document group satisfying the condition as a target document group candidate;
A procedure for outputting the document group registered in the target document group candidate;
A document group detection method characterized by comprising:

（付記２）前記収集手順は、前記検出対象の文書群または前記文書群に属する文書を特徴付けるキーワードが取得されると、前記キーワードに予め設定される拡張語を付加して拡張キーワードを生成し、前記キーワードに加え、前記拡張キーワードを用いて検索を行う、手順であることを特徴とする付記１記載の文書群検出方法。 (Supplementary Note 2) When a keyword characterizing a document belonging to the detection target document group or the document group is acquired, the collection procedure generates an extended keyword by adding a preset extended word to the keyword, The document group detection method according to appendix 1, wherein a search is performed using the extended keyword in addition to the keyword.

（付記３）前記検出対象の文書群は、任意の分野の複数の用語の解説情報に関する文書群であり、前記特定文書は解説される用語の一覧が記述され、また、前記複数の配下文書は用語の解説が記述される用語解説文書である辞書文書群であって、
また、前記連結情報は前記特定文書及び前記用語解説文書に出現する前記用語を表す文字列に付加され、該文字列と該文字列に対応する前記用語解説文書を関連付けており、
前記特徴集計手順における前記特定の関係は、前記連結情報によって関連付けられた前記文字列と、前記連結先文書との特定の関係である、
ことを特徴とする付記１記載の文書群検出方法。 (Supplementary Note 3) The document group to be detected is a document group related to commentary information on a plurality of terms in an arbitrary field, the specific document describes a list of terms to be explained, and the subordinate documents are A dictionary document group that is a glossary document in which explanations of terms are described,
Further, the connection information is added to a character string representing the term appearing in the specific document and the glossary document, and associates the character string with the glossary document corresponding to the character string,
The specific relationship in the feature counting procedure is a specific relationship between the character string associated by the connection information and the connection destination document.
The document group detection method according to supplementary note 1, wherein:

（付記４）前記特徴集計手順において、前記連結情報が付加された前記文字列と、前記文字列に対応する前記連結先文書とを解析し、前記連結情報によって同じ前記文書群に属している前記用語解説文書が関連付けられている内部連結情報の数と、前記特定の関係となる状態として前記連結情報が付加される前記文字列と前記連結先文書のタイトルに含まれる文字列とが一致する特定内部連結情報の数と、を集計し、
前記文書群判定手順において、前記特定内部連結情報が前記内部連結情報全体に占める割合を、辞書らしさスコアとして算出し、前記辞書らしさスコアを前記特徴ルールに基づく閾値と比較して前記文書群が前記辞書文書群の条件を満たすかどうかを判定する、
手順であることを特徴とする付記３記載の文書群検出方法。 (Supplementary Note 4) In the feature aggregation procedure, the character string to which the link information is added and the link destination document corresponding to the character string are analyzed, and the link information belongs to the same document group. A specification in which the number of internally linked information associated with a glossary document matches the character string included in the title of the linked document and the character string to which the linked information is added as the specific relationship. The number of internally consolidated information and
In the document group determination procedure, a ratio of the specific internal connection information to the entire internal connection information is calculated as a dictionary-like score, and the dictionary-like score is compared with a threshold value based on the feature rule. Determine whether the conditions of the dictionary document group are satisfied,
The document group detection method according to attachment 3, wherein the document group detection method is a procedure.

（付記５）前記特徴集計手順において、前記連結情報が付加された前記文字列と、前記文字列に対応する前記連結先文書とを解析し、前記連結情報によって同じ前記文書群に属している前記用語解説文書が関連付けられている内部連結情報の数と、前記特定の関係となる状態として前記連結情報が付加される前記文字列と前記連結先文書のファイル名に含まれる文字列とが一致する特定内部連結情報の数と、を集計し、
前記文書群判定手順において、前記特定内部連結情報が前記内部連結情報全体に占める割合を、辞書らしさスコアとして算出し、前記辞書らしさスコアを前記特徴ルールに基づく閾値と比較して前記文書群が前記辞書文書群の条件を満たすかどうかを判定する、
手順であることを特徴とする付記３記載の文書群検出方法。 (Supplementary Note 5) In the feature aggregation procedure, the character string to which the link information is added and the link destination document corresponding to the character string are analyzed, and the link information belongs to the same document group. The number of internal link information associated with the glossary document matches the character string added to the link information and the character string included in the file name of the link destination document as the specific relationship. The number of specific internal consolidated information and
In the document group determination procedure, a ratio of the specific internal connection information to the entire internal connection information is calculated as a dictionary-like score, and the dictionary-like score is compared with a threshold value based on the feature rule. Determine whether the conditions of the dictionary document group are satisfied,
The document group detection method according to attachment 3, wherein the document group detection method is a procedure.

（付記６）前記特徴集計手順において、前記連結情報が付加された前記文字列と、前記文字列に対応する前記連結先文書とを解析し、前記連結情報によって同じ前記文書群に属している前記用語解説文書が関連付けられている内部連結情報の数と、前記特定の関係となる状態として、前記連結先文書について、前記連結先文書のタイトルから前記連結先文書を関連付けた前記連結情報が付加される文字列と同じ部分を除いた共通タイトル部を抽出して比較し、前記共通タイトル部が一致する特定連結情報の数と、を集計し、
前記文書群判定手順において、前記特徴ルールに基づいて、前記特定連結情報が前記内部連結情報全体に占める割合を、辞書らしさスコアとして算出し、前記辞書らしさスコアを前記特徴ルールに基づく閾値と比較して前記文書群が前記辞書文書群の条件を満たすかどうかを判定する、
手順であることを特徴とする付記３記載の文書群検出方法。 (Additional remark 6) In the said feature totalization procedure, the said character string to which the said connection information was added and the said connection destination document corresponding to the said character string are analyzed, and the said belonging to the same said document group by the said connection information As the state of the specific relationship with the number of internal connection information associated with the glossary document, the connection information relating the connection destination document from the title of the connection destination document is added to the connection destination document. The common title part excluding the same part as the character string is extracted and compared, and the number of specific linked information that matches the common title part is totaled,
In the document group determination procedure, based on the feature rule, a ratio of the specific link information to the entire internal link information is calculated as a dictionary-like score, and the dictionary-like score is compared with a threshold value based on the feature rule. Determining whether the document group satisfies the conditions of the dictionary document group,
The document group detection method according to attachment 3, wherein the document group detection method is a procedure.

（付記７）前記文書群判定手順は、前記特定の関係が複数選択されるときは、それぞれの前記特定の関係の重要度に応じた所定の係数が定義される前記特徴ルールに基づいて、選択された前記特定の関係について算出された前記辞書らしさスコアに前記所定の係数を乗算して重み付けを行って、選択された前記特定の関係に応じた辞書らしさスコアを算出する、手順であることを特徴とする付記４、５、及び６記載の文書群検出方法。 (Supplementary Note 7) When a plurality of the specific relations are selected, the document group determination procedure is selected based on the feature rule in which a predetermined coefficient corresponding to the importance of each specific relation is defined. The dictionary-likeness score calculated for the specific relationship is multiplied by the predetermined coefficient and weighted to calculate a dictionary-likeness score corresponding to the selected specific relationship. The document group detection method according to appendices 4, 5, and 6, which is a feature.

（付記８）前記特徴集計手順は、さらに、前記特定の関係を満たした前記文字列と前記連結先文書について、前記文字列と前記連結先文書の識別情報とを関連付けたエントリ情報を生成し、
前記出力する手順は、前記対象文書群候補に登録された前記文書群の識別情報とともに前記文書群について生成された前記エントリ情報を出力する、
手順であることを特徴とする付記４、５、または６記載の文書群検出方法。 (Additional remark 8) The said feature totaling procedure produces | generates the entry information which linked | related the identification information of the said character string and the said connection destination document further about the said character string and the said connection destination document which satisfy | filled the said specific relationship,
The outputting step outputs the entry information generated for the document group together with the identification information of the document group registered in the target document group candidate.
The document group detection method according to appendix 4, 5, or 6, wherein the document group detection method is a procedure.

（付記９）前記出力する手順は、利用者が前記対象文書群候補を目的の文書群に指定したときは、指定された前記対象文書群候補の識別情報を前記目的の文書群が登録される文書群情報に登録して文書群記憶手段に格納するとともに、前記対象文書群候補に対応する前記エントリ情報を前記文書群情報に関連付けて前記文書群記憶手段に格納する、手順であることを特徴とする付記８記載の文書群検出方法。 (Supplementary Note 9) In the output procedure, when the user designates the target document group candidate as the target document group, the target document group is registered with the identification information of the designated target document group candidate. It is a procedure for registering in the document group information and storing it in the document group storage unit, and storing the entry information corresponding to the target document group candidate in the document group storage unit in association with the document group information. The document group detection method according to appendix 8.

（付記１０）ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出装置において、
特定文書の配下に複数の配下文書が存在する階層構造を成す文書群を対象にして、特定のキーワードを用いて該配下文書のいずれかを検索し、検索された該配下文書に基づいて前記特定文書を検出し、該特定文書の配下の複数の配下文書を収集する収集手段と、
収集された前記特定文書及び前記文書群の複数の配下文書それぞれについて、配下文書内の任意の文字列に付加される特定の他文書との関連を示す連結情報を抽出して、該配下文書と、関連付けられた連結先文書とで特定の関係となる状態数を集計する特徴集計手段と、
前記特定の関係を用いた条件である特徴ルールが格納される特徴ルール記憶手段から該特徴ルールを読み出し、前記文書群の特定の関係の状態数が該特徴ルールの条件を満たしているか判定し、条件を満たしている文書群を対象文書群候補に登録する文書群判定手段と、
前記対象文書群候補に登録された前記文書群を出力する出力手段と、
を有することを特徴とする文書群検出装置。 (Supplementary Note 10) In a document group detection apparatus for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
For a group of documents having a hierarchical structure in which a plurality of subordinate documents exist under the specific document, the subordinate document is searched using a specific keyword, and the specific is determined based on the searched subordinate document. A collecting means for detecting a document and collecting a plurality of subordinate documents under the specific document;
For each of the collected subordinate documents of the specific document and the group of documents, connection information indicating a relationship with a specific other document added to an arbitrary character string in the subordinate document is extracted, and the subordinate document and , A feature counting unit that counts the number of states that have a specific relationship with the associated linked document,
Read out the feature rule from the feature rule storage means in which the feature rule that is a condition using the specific relationship is stored, determine whether the number of states of the specific relationship of the document group satisfies the condition of the feature rule, A document group determination means for registering a document group satisfying the condition as a target document group candidate;
Output means for outputting the document group registered in the target document group candidate;
A document group detection apparatus comprising:

（付記１１）ネットワーク上で提供される文書の集合であって１またはそれ以上のコンピュータによって管理されている所定の文書群を検出する文書群検出プログラムにおいて、
コンピュータを、
特定文書の配下に複数の配下文書が存在する階層構造を成す文書群を対象にして、特定のキーワードを用いて該配下文書のいずれかを検索し、検索された該配下文書に基づいて前記特定文書を検出し、該特定文書の配下の複数の配下文書を収集する収集手段、
収集された前記特定文書及び前記文書群の複数の配下文書それぞれについて、配下文書内の任意の文字列に付加される特定の他文書との関連を示す連結情報を抽出して、該配下文書と、関連付けられた連結先文書とで特定の関係となる状態数を集計する特徴集計手段、
前記特定の関係を用いた条件である特徴ルールが格納される特徴ルール記憶手段から該特徴ルールを読み出し、前記文書群の特定の関係の状態数が該特徴ルールの条件を満たしているか判定し、条件を満たしている文書群を対象文書群候補に登録する文書群判定手段、
前記対象文書群候補に登録された前記文書群を出力する出力手段、
として機能させることを特徴とする文書群検出プログラム。 (Supplementary Note 11) In a document group detection program for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
Computer
For a group of documents having a hierarchical structure in which a plurality of subordinate documents exist under the specific document, the subordinate document is searched using a specific keyword, and the specific is determined based on the searched subordinate document. A collecting means for detecting a document and collecting a plurality of subordinate documents under the specific document;
For each of the collected subordinate documents of the specific document and the group of documents, connection information indicating a relationship with a specific other document added to an arbitrary character string in the subordinate document is extracted, and the subordinate document and , Feature counting means for counting the number of states that have a specific relationship with the linked document
Read out the feature rule from the feature rule storage means in which the feature rule that is a condition using the specific relationship is stored, determine whether the number of states of the specific relationship of the document group satisfies the condition of the feature rule, Document group determination means for registering a document group satisfying the condition as a target document group candidate,
Output means for outputting the document group registered in the target document group candidate;
A document group detection program characterized by functioning as

発明の概要を示す図である。It is a figure which shows the outline | summary of invention. 辞書サイト検出システムの構成例を示した図である。It is the figure which showed the structural example of the dictionary site detection system. 辞書サイト検出サーバのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of a dictionary site detection server. 辞書サイトの特徴を説明するための図である。It is a figure for demonstrating the characteristic of a dictionary site. キーワードが入力されてからサイトのページ情報を取得するまでの処理の流れを示した図である。It is the figure which showed the flow of the process after inputting a keyword until acquiring page information of a site. ＵＲＬ−タイトルテーブルの一例を示した図である。It is the figure which showed an example of URL-title table. アンカーテキスト−リンク先ＵＲＬテーブルの一例を示した図である。It is the figure which showed an example of the anchor text-link destination URL table. 特徴１による集計情報の一例を示した図である。FIG. 5 is a diagram illustrating an example of total information according to feature 1. 特徴２による集計情報の一例を示した図である。FIG. 6 is a diagram illustrating an example of total information according to feature 2. 特徴３による集計情報の一例を示した図である。FIG. 6 is a diagram illustrating an example of total information according to feature 3. 辞書候補テーブルとそのエントリ候補テーブルの一例を示した図である。It is the figure which showed an example of a dictionary candidate table and its entry candidate table. 辞書サイト候補一覧画面の一例を示した図である。It is the figure which showed an example of the dictionary site candidate list screen. 辞書サイト検出方法の全体の処理手順を示したフローチャートである。It is the flowchart which showed the whole process sequence of the dictionary site detection method. サイト取得処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of site acquisition processing. リンク情報抽出処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the link information extraction process. リンク情報Ａ抽出処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the link information A extraction process. リンク情報Ｂ抽出処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the link information B extraction process. リンク特徴集計処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the link characteristic total process. 特徴１分析処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the characteristic 1 analysis process. 特徴２分析処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the feature 2 analysis processing. 特徴３分析処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the characteristic 3 analysis process. 辞書サイト判定処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of dictionary site determination processing. 辞書エントリ候補作成処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of dictionary entry candidate creation processing. ユーザ提示処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the user presentation process.

Explanation of symbols

１０文書群検出装置
１１ａ文書記憶手段
１１ｂ特徴ルール記憶手段
１１ｃ集計情報記憶手段
１１ｄ文書群候補記憶手段
１１ｅ文書群記憶手段
１２文書収集手段
１３特徴集計手段
１４文書群判定手段
１５文書群提示手段 DESCRIPTION OF SYMBOLS 10 Document group detection apparatus 11a Document storage means 11b Feature rule storage means 11c Total information storage means 11d Document group candidate storage means 11e Document group storage means 12 Document collection means 13 Feature totalization means 14 Document group determination means 15 Document group presentation means 15

Claims

In a document group detection method for detecting a predetermined document group that is a set of documents provided on a network and managed by one or more computers,
The computer is
For a group of documents having a hierarchical structure in which a plurality of subordinate documents exist under the specific document, the subordinate document is searched using a specific keyword, and the specific is determined based on the searched subordinate document. A collection procedure for detecting a document and collecting a plurality of subordinate documents under the specific document;
For each of the collected specific document and the plurality of subordinate documents in the collected document group, connection information indicating a relation with a specific other document added to an arbitrary character string in the subordinate document is extracted, and the subordinate document and A feature counting procedure that counts the number of states that have a specific relationship with the associated linked document,
Read out the feature rule from the feature rule storage means in which the feature rule that is a condition using the specific relationship is stored, determine whether the number of states of the specific relationship of the document group satisfies the condition of the feature rule, A document group determination procedure for registering a document group satisfying the condition as a target document group candidate;
A procedure for outputting the document group registered in the target document group candidate;
A document group detection method characterized by comprising:

The document group to be detected is a document group regarding explanation information of a plurality of terms in an arbitrary field, the specific document describes a list of terms to be explained, and the subordinate documents have explanations of terms. A dictionary document group that is a glossary document to be described,
Further, the connection information is added to a character string representing the term appearing in the specific document and the glossary document, and associates the character string with the glossary document corresponding to the character string,
The specific relationship in the feature counting procedure is a specific relationship between the character string associated by the connection information and the connection destination document.
The document group detection method according to claim 1.

In the feature counting procedure, the character string to which the link information is added and the link destination document corresponding to the character string are analyzed, and the glossary document belonging to the same document group is determined by the link information. The number of associated internal connection information and the specific internal connection information in which the character string to which the connection information is added and the character string included in the title of the connection destination document match as the state of the specific relationship. The number and
In the document group determination procedure, a ratio of the specific internal connection information to the entire internal connection information is calculated as a dictionary-like score, and the dictionary-like score is compared with a threshold value based on the feature rule. Determine whether the conditions of the dictionary document group are satisfied,
3. The document group detection method according to claim 2, wherein the document group detection method is a procedure.

In the feature counting procedure, the character string to which the link information is added and the link destination document corresponding to the character string are analyzed, and the glossary document belonging to the same document group is determined by the link information. Specific internal link information in which the number of associated internal link information and the character string to which the link information is added as a state of the specific relationship match the character string included in the file name of the link destination document And the number of
In the document group determination procedure, a ratio of the specific internal connection information to the entire internal connection information is calculated as a dictionary-like score, and the dictionary-like score is compared with a threshold value based on the feature rule. Determine whether the conditions of the dictionary document group are satisfied,
3. The document group detection method according to claim 2, wherein the document group detection method is a procedure.

In the feature counting procedure, the character string to which the link information is added and the link destination document corresponding to the character string are analyzed, and the glossary document belonging to the same document group is determined by the link information. The number of associated internal linkage information and the character string to which the linkage information relating the linkage destination document from the title of the linkage destination document is added to the linkage destination document as the state of the specific relationship. Extract and compare the common title part excluding the same part, and count the number of specific linked information that matches the common title part,
In the document group determination procedure, based on the feature rule, a ratio of the specific link information to the entire internal link information is calculated as a dictionary-like score, and the dictionary-like score is compared with a threshold value based on the feature rule. Determining whether the document group satisfies the conditions of the dictionary document group,
3. The document group detection method according to claim 2, wherein the document group detection method is a procedure.

In a document group detection apparatus for detecting a predetermined document group which is a set of documents provided on a network and managed by one or more computers,
For a group of documents having a hierarchical structure in which a plurality of subordinate documents exist under the specific document, the subordinate document is searched using a specific keyword, and the specific is determined based on the searched subordinate document. A collecting means for detecting a document and collecting a plurality of subordinate documents under the specific document;
For each of the collected subordinate documents of the specific document and the group of documents, connection information indicating a relationship with a specific other document added to an arbitrary character string in the subordinate document is extracted, and the subordinate document and , A feature counting unit that counts the number of states that have a specific relationship with the associated linked document,
Read out the feature rule from the feature rule storage means in which the feature rule that is a condition using the specific relationship is stored, determine whether the number of states of the specific relationship of the document group satisfies the condition of the feature rule, A document group determination means for registering a document group satisfying the condition as a target document group candidate;
Output means for outputting the document group registered in the target document group candidate;
A document group detection apparatus comprising: