JP6104729B2

JP6104729B2 - Content search system, content search method, and content search program

Info

Publication number: JP6104729B2
Application number: JP2013126942A
Authority: JP
Inventors: 加藤　剛志; 剛志加藤; 圭黒田; 隼赤塚
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2013-06-17
Filing date: 2013-06-17
Publication date: 2017-03-29
Anticipated expiration: 2033-06-17
Also published as: JP2015001899A

Description

本発明は、ネットワーク上のコンテンツを検索するコンテンツ検索システム、コンテンツ検索方法、及びコンテンツ検索プログラムに関する。 The present invention relates to a content search system, a content search method, and a content search program for searching for content on a network.

従来のコンテンツ検索システムでは、例えば特許文献１に記載されているように、予め収集されたコンテンツから、当該コンテンツの本文を抽出する抽出ルールに従って本文を抽出し、抽出された本文を用いて検索インデックスを作成している。このコンテンツ検索システムでは、検索キーワードによる検索要求を受けると、作成された検索インデックスを検索し、その検索結果として予め収集されたコンテンツから抽出された本文の全部又は一部を表示する。 In a conventional content search system, as described in, for example, Patent Document 1, a text is extracted from previously collected content according to an extraction rule for extracting the text of the content, and a search index is used using the extracted text. Have created. In this content search system, when a search request by a search keyword is received, the created search index is searched, and all or part of the text extracted from the content collected in advance as the search result is displayed.

また、従来のコンテンツ検索システムでは、例えば特許文献２に記載されているように、検索インデックスの作成に用いるコンテンツの本文を、抽出ルールを用いず、コンテンツのリンク関係を用いて抽出するものが知られている。このコンテンツ検索システムでは、予め収集されたリンク元のＨＴＭＬファイル内に存在するハイパーリンクに基づいてリンク先のＨＴＭＬファイルを特定し、特定されたリンク先のＨＴＭＬファイル内のテキスト情報とリンク元のハイパーリンク周辺の文字列とを比較することにより、リンク先のＨＴＭＬファイルから本文部分を抽出している。 Also, in the conventional content search system, as described in Patent Document 2, for example, a content body used for creating a search index is extracted using a content link relationship without using an extraction rule. It has been. In this content search system, a link destination HTML file is specified based on a hyperlink existing in a link source HTML file collected in advance, and the text information in the link destination HTML file and the link source hyperfile are specified. By comparing the character string around the link, the body part is extracted from the linked HTML file.

特開２００４−２２０２５１号公報JP 2004-220251 A 特開２０１３−３００４１号公報JP 2013-30041 A

しかしながら、特許文献１に記載のコンテンツ検索システムでは、検索結果として表示される本文は、予め収集されたコンテンツから抽出ルールに従って抽出されたものである。よって、例えば検索を要求する時点でコンテンツの内容に修正や変更があったとしても、修正や変更がされる前のコンテンツの本文しか検索結果として表示することができない。 However, in the content search system described in Patent Document 1, the text displayed as the search result is extracted from the previously collected content according to the extraction rule. Therefore, for example, even if there is a correction or change in the content at the time when the search is requested, only the text of the content before the correction or change can be displayed as the search result.

また、特許文献２に記載の技術のように、コンテンツの本文の抽出方法として、抽出ルールを用いずにコンテンツのリンク関係を用いたとしても、抽出する本文はリンク元のＨＴＭＬファイルから予め収集されている。このため、検索結果として表示されるコンテンツの本文は、予め収集されたリンク先のＨＴＭＬファイルから抽出されたものである。よって、特許文献１に記載の技術と同様、最新のコンテンツの本文を検索結果として表示することができない場合がある。 Further, as in the technique described in Patent Document 2, even if the content link relation is used without using the extraction rule as the content body extraction method, the extracted body text is collected in advance from the link source HTML file. ing. For this reason, the text of the content displayed as the search result is extracted from the linked HTML files collected in advance. Therefore, as in the technique described in Patent Document 1, the text of the latest content may not be displayed as a search result.

本発明は上記実情に鑑みてなされたものであり、検索結果として最新のコンテンツの本文を表示することが可能なコンテンツ検索システム、コンテンツ検索方法、及びコンテンツ検索プログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a content search system, a content search method, and a content search program capable of displaying the text of the latest content as a search result.

本発明の一形態に係るコンテンツ検索システムは、検索サーバ装置と、検索クライアント装置と、を備えるコンテンツ検索システムであって、検索サーバ装置は、通信ネットワーク上からコンテンツを取得する情報収集手段と、情報収集手段によって取得されたコンテンツの本文を抽出するための抽出ルールを、コンテンツの格納先情報に関連付けて保持する抽出ルール保持手段と、抽出ルール保持手段によって保持される抽出ルールに基づいて、情報収集手段により取得されたコンテンツの本文を抽出するテキスト抽出手段と、テキスト抽出手段により抽出されたコンテンツの本文を、コンテンツの格納先情報を含むコンテンツに関する情報と関連付けて検索インデックスに登録する登録手段と、コンテンツを検索するための検索要求を検索クライアント装置から受信すると、検索要求に基づいて、検索インデックスの中から特定の検索インデックスを抽出し、特定の検索インデックスに対応するコンテンツの格納先情報と、抽出ルール保持手段から抽出される該コンテンツの格納先情報に対応するに対応する特定の抽出ルールとの組み合わせ、又は、特定の検索インデックスに登録されたコンテンツの本文のいずれかを、コンテンツに関する情報に基づいて選択して前記クライアント装置に返却する検索手段と、を有し、検索クライアント装置は、検索要求を検索サーバ装置の検索手段に送信し、その検索結果として検索手段により返却されるコンテンツの格納先情報と特定の抽出ルールとの組み合わせ又はコンテンツの本文を受信し、受信したコンテンツの格納先情報を用いて通信ネットワーク上からコンテンツを取得する情報取得手段と、情報取得手段によりコンテンツが取得された後に、検索サーバ装置から受信した特定の抽出ルールによりコンテンツの本文を抽出する最新テキスト抽出手段と、情報取得手段により受信されたコンテンツの本文又は最新テキスト抽出手段により抽出されたコンテンツの本文を表示する表示手段と、を有する。 A content search system according to an aspect of the present invention is a content search system including a search server device and a search client device, wherein the search server device includes information collection means for acquiring content from a communication network, and information Information is collected based on the extraction rule holding means for holding the extraction rule for extracting the body text of the content acquired by the collecting means in association with the storage destination information of the content, and the extraction rule held by the extraction rule holding means Text extracting means for extracting the body of the content acquired by the means, registration means for registering the body of the content extracted by the text extracting means in the search index in association with information about the content including the storage location information of the content, Detect search requests to search for content When received from the client device, a specific search index is extracted from the search index based on the search request, the storage location information of the content corresponding to the specific search index, and the content extracted from the extraction rule holding means Either a combination with a specific extraction rule corresponding to the storage location information or a content body registered in a specific search index is selected based on information about the content and returned to the client device The search client device transmits a search request to the search unit of the search server device, and a combination of content storage destination information returned by the search unit as a search result and a specific extraction rule or Receives the content text and uses the received content storage location information Information acquisition means for acquiring content from a communication network, latest text extraction means for extracting the body of the content according to a specific extraction rule received from the search server device after the content is acquired by the information acquisition means, and information acquisition means Display means for displaying the body of the content received by or the content body extracted by the latest text extraction means.

或いは、本発明の他の形態に係るコンテンツ検索方法は、検索サーバ装置と、検索クライアント装置と、によってコンテンツを検索するコンテンツ検索方法であって、検索サーバ装置が、通信ネットワーク上からコンテンツを取得する情報収集ステップと、情報収集ステップにおいて取得されたコンテンツの本文を抽出するための抽出ルールを、コンテンツの格納先情報に関連付けて保持する抽出ルール保持ステップと、抽出ルール保持ステップにおいて保持される抽出ルールに基づいて、情報収集ステップで取得されたコンテンツの本文を抽出するテキスト抽出ステップと、テキスト抽出ステップにおいて抽出されたコンテンツの本文を、コンテンツの格納先情報を含むコンテンツに関する情報と関連付けて検索インデックスに登録する登録ステップと、コンテンツを検索するための検索要求を検索クライアント装置から受信すると、検索要求に基づいて、検索インデックスの中から特定の検索インデックスを抽出し、特定の検索インデックスに対応するコンテンツの格納先情報と、抽出ルール保持ステップにおいて抽出される該コンテンツの格納先情報に対応する特定の抽出ルールとの組み合わせ、又は、特定の検索インデックスに登録されたコンテンツの本文のいずれかを、コンテンツの格納先情報に基づいて選択してクライアント装置に返却する検索ステップと、を含み、検索クライアント装置が、検索要求を検索サーバ装置に送信し、その検索結果として検索ステップにおいて返却されるコンテンツの格納先情報と特定の抽出ルールとの組み合わせ又はコンテンツの本文を受信し、受信したコンテンツの格納先情報を用いて通信ネットワーク上からコンテンツを取得する情報取得ステップと、情報取得ステップにおいてコンテンツが取得された後に、検索サーバ装置から受信した特定の抽出ルールによりコンテンツの本文を抽出する最新テキスト抽出ステップと、情報取得ステップにおいて受信されたコンテンツの本文又は最新テキスト抽出ステップで抽出されたコンテンツの本文を表示する表示ステップと、を含む。 Alternatively, a content search method according to another aspect of the present invention is a content search method for searching for content by a search server device and a search client device, and the search server device acquires content from a communication network. An information collection step, an extraction rule holding step for holding an extraction rule for extracting the text of the content acquired in the information collection step in association with the storage location information of the content, and an extraction rule held in the extraction rule holding step A text extraction step for extracting the body of the content acquired in the information collection step, and the content body extracted in the text extraction step in association with information about the content including the content storage location information in the search index To register Step and when a search request for searching for content is received from the search client device, based on the search request, a specific search index is extracted from the search index, and content storage location information corresponding to the specific search index And the combination of the specific extraction rule corresponding to the storage location information of the content extracted in the extraction rule holding step, or the content body registered in the specific search index, the content storage location information A search step that selects and returns to the client device based on the search information, and the search client device sends a search request to the search server device and specifies the storage location information of the content returned in the search step as the search result In combination with extraction rules or content body An information acquisition step of acquiring content from a communication network using the storage location information of the received content, and after the content is acquired in the information acquisition step, the specific extraction rule received from the search server device A latest text extracting step for extracting a body, and a display step for displaying the body of the content received in the information acquisition step or the body of the content extracted in the latest text extracting step.

或いは、本発明の他の形態に係るコンテンツ検索プログラムは、検索サーバ装置と、検索クライアント装置と、によってコンテンツを検索するコンテンツ検索プログラムであって、検索サーバ装置として動作するコンピュータを、通信ネットワーク上からコンテンツを取得する情報収集手段と、情報収集手段によって取得されたコンテンツの本文を抽出するための抽出ルールを、コンテンツの格納先情報に関連付けて保持する抽出ルール保持手段と、抽出ルール保持手段によって保持される抽出ルールに基づいて、情報収集手段により取得されたコンテンツの本文を抽出するテキスト抽出手段と、テキスト抽出手段により抽出されたコンテンツの本文を、コンテンツの格納先情報を含むコンテンツに関する情報と関連付けて検索インデックスに登録する登録手段と、コンテンツを検索するための検索要求を検索クライアント装置から受信すると、検索要求に基づいて、検索インデックスの中から特定の検索インデックスを抽出し、特定の検索インデックスに対応するコンテンツの格納先情報と、抽出ルール保持手段から抽出される該コンテンツの格納先情報に対応する特定の抽出ルールとの組み合わせ、又は、特定の検索インデックスに登録されたコンテンツの本文のいずれかを、コンテンツに関する情報に基づいて選択してクライアント装置に返却する検索手段として機能させ、検索クライアント装置として動作するコンピュータを、検索要求を検索サーバ装置の検索手段に送信し、その検索結果として検索手段により返却されるコンテンツの格納先情報と特定の抽出ルールとの組み合わせ又はコンテンツの本文を受信し、受信したコンテンツの格納先情報を用いて通信ネットワーク上からコンテンツを取得する情報取得手段と、情報取得手段によりコンテンツが取得された後に、検索サーバ装置から受信した特定の抽出ルールによりコンテンツの本文を抽出する最新テキスト抽出手段と、情報取得手段により受信されたコンテンツの本文又は最新テキスト抽出手段により抽出されたコンテンツの本文を表示する表示手段として機能させる。 Alternatively, a content search program according to another aspect of the present invention is a content search program for searching for content by a search server device and a search client device, and a computer that operates as the search server device is connected to a communication network. Information collection means for acquiring content, extraction rule holding means for holding the extraction rule for extracting the body of the content acquired by the information collection means in association with content storage location information, and holding by the extraction rule holding means A text extracting unit that extracts the body of the content acquired by the information collecting unit, and the content body extracted by the text extracting unit is associated with information about the content including the content storage location information Search index When a registration means for registration and a search request for searching for content are received from the search client device, a specific search index is extracted from the search index based on the search request, and the content corresponding to the specific search index is extracted. Either the combination of the storage location information and the specific extraction rule corresponding to the storage location information of the content extracted from the extraction rule holding means, or the content text registered in the specific search index is related to the content. A computer that operates as a search client device is selected based on information and returned to the client device, and a search request is transmitted to the search device of the search server device, and the search result is returned by the search device. Content storage location information and specific extraction rules Received from the search server device after the content is acquired by the information acquisition means, the information acquisition means for acquiring the content from the communication network using the storage location information of the received content The latest text extraction unit that extracts the content body according to a specific extraction rule and the display unit that displays the content body received by the information acquisition unit or the content body extracted by the latest text extraction unit.

この発明の上記いずれかの形態によれば、検索サーバ装置側において通信ネットワーク上からコンテンツが取得され、コンテンツの格納先情報に関連付けて保持された抽出ルールに基づいて、コンテンツの本文が抽出される。抽出されたコンテンツの本文は、コンテンツの格納先情報を含むコンテンツに関する情報と関連付けて検索インデックスに登録されている。検索サーバ装置は、コンテンツを検索するための検索要求を検索クライアント装置から受信すると、当該検索要求に基づいて、検索インデックスの中から特定の検索インデックスを抽出し、当該検索インデックスに対応する情報を検索クライアント装置へ返却する。この際、検索サーバ装置は、コンテンツに関する情報に基づいて、特定の検索インデックスに対応するコンテンツの格納先情報と、該コンテンツの格納先情報に対応する特定の抽出ルールとの組み合わせ、又は、特定の検索インデックスに登録されたコンテンツの本文のいずれかを、検索クライアント装置へ返却する情報とすることができる。検索クライアント装置側では、検索要求を検索サーバ装置に送信し、その検索結果としてコンテンツの格納先情報と特定の抽出ルールとの組み合わせ又はコンテンツの本文を受信することにより、当該コンテンツの格納先情報を用いて通信ネットワーク上からコンテンツを取得でき、取得した当該コンテンツについて検索サーバ装置から受信した特定の抽出ルールにより本文を抽出することができる。よって、例えば検索サーバ装置から受信したコンテンツの本文が最新でない場合などには、必要に応じて検索クライアント装置側で最新のコンテンツの本文を取得して表示することができる。また、例えば検索サーバ装置から受信したコンテンツの本文が最新である場合には、検索クライアント装置側で改めてコンテンツの本文を取得するまでもなく、当該受信したコンテンツの本文を表示することができる。以上より、検索結果として、最新のコンテンツの本文を表示することが可能となる。 According to any one of the above aspects of the present invention, content is acquired from the communication network on the search server device side, and the content body is extracted based on the extraction rule held in association with the storage location information of the content. . The body of the extracted content is registered in the search index in association with information about the content including the content storage location information. When the search server device receives a search request for searching for content from the search client device, the search server device extracts a specific search index from the search index based on the search request and searches for information corresponding to the search index. Return to client device. At this time, the search server device, based on the information about the content, combines the storage location information of the content corresponding to the specific search index and the specific extraction rule corresponding to the storage location information of the content, or the specific Any of the texts of the contents registered in the search index can be used as information to be returned to the search client device. The search client device transmits a search request to the search server device and receives the combination of content storage location information and a specific extraction rule or the content text as the search result, thereby storing the content storage location information. The content can be acquired from the communication network, and the text can be extracted from the acquired content according to the specific extraction rule received from the search server device. Therefore, for example, when the content text received from the search server device is not the latest, the latest content text can be acquired and displayed on the search client device side as necessary. For example, when the text of the content received from the search server device is the latest, it is possible to display the text of the received content without acquiring the content text again on the search client device side. As described above, the latest content text can be displayed as a search result.

また、コンテンツに関する情報は、コンテンツの本文が検索インデックスに登録された日時に関する登録日時情報を含み、検索手段は、登録日時情報に基づき、特定の検索インデックスに登録されたコンテンツの本文の当該登録の日時が所定の日時に対して新しいか否かを判定し、登録の日時が所定の日時よりも新しくない場合には、特定の検索インデックスに対応するコンテンツの格納先情報と、抽出ルール保持手段から抽出される該コンテンツの格納先情報に対応する特定の抽出ルールとの組み合わせを検索クライアント装置に返却し、登録の日時が所定の日時よりも新しい場合には、特定の検索インデックスに登録されたコンテンツの本文を検索クライアント装置に返却することが好ましい。この構成によれば、コンテンツの本文の登録の日時が所定の日時より新しくない場合には、検索サーバ装置側から検索クライアント装置側に、検索結果として該当するコンテンツの格納先情報と抽出ルールとの組み合わせが返却されることになる。よって、検索クライアント装置側においては、検索サーバ装置側から受信したコンテンツの格納先情報と特定の抽出ルールとを用いることにより、通信ネットワーク上から最新のコンテンツの本文を抽出して表示することができる。また、登録の日時が所定の日時より新しい場合には、検索サーバ装置側から検索クライアント装置側に、検索結果として該当するコンテンツの本文が返却されることになる。よって、検索クライアント装置側においては、抽出した日時が所定の日時より新しいコンテンツの本文を表示することができる。以上より、検索結果として、最新のコンテンツの本文を表示することが可能となる。 Further, the information on the content includes registration date / time information on the date / time when the text of the content is registered in the search index, and the search unit is configured to perform registration of the text of the content registered in the specific search index based on the registration date / time information. It is determined whether or not the date / time is newer than the predetermined date / time, and if the registration date / time is not newer than the predetermined date / time, the storage location information of the content corresponding to the specific search index and the extraction rule holding means If the combination with the specific extraction rule corresponding to the storage location information of the content to be extracted is returned to the search client device and the registration date is newer than the predetermined date, the content registered in the specific search index Is preferably returned to the search client device. According to this configuration, when the date and time of registration of the text of the content is not newer than the predetermined date and time, the search server device side sends to the search client device side the relevant content storage location information and the extraction rule. The combination will be returned. Therefore, on the search client device side, the latest content text can be extracted and displayed from the communication network by using the content storage location information received from the search server device side and the specific extraction rule. . When the registration date is newer than the predetermined date, the text of the corresponding content is returned as a search result from the search server device side to the search client device side. Therefore, on the search client device side, it is possible to display the content text whose extracted date is newer than the predetermined date. As described above, the latest content text can be displayed as a search result.

また、検索手段は、特定の検索インデックスに対応するコンテンツの格納先情報が特定のコンテンツの格納先情報であるか否かを判定し、該コンテンツの格納先情報が特定のコンテンツの格納先情報である場合には、特定のコンテンツの格納先情報と、抽出ルール保持手段から抽出される該特定のコンテンツの格納先情報に対応する特定の抽出ルールとの組み合わせを検索クライアント装置に返却し、コンテンツの格納先情報が特定のコンテンツの格納先情報でない場合には、特定の検索インデックスに登録されたコンテンツの本文を検索クライアント装置に返却することが好ましい。この構成によれば、例えばコンテンツの格納先情報が、コンテンツの内容が頻繁に更新されているような特定のコンテンツの格納先情報である場合に、検索クライアント装置側において、通信ネットワーク上から最新のコンテンツの本文を抽出して表示することができる。 Further, the search means determines whether the storage location information of the content corresponding to the specific search index is the storage location information of the specific content, and the storage location information of the content is the storage location information of the specific content. In some cases, the combination of the specific content storage location information and the specific extraction rule corresponding to the specific content storage location information extracted from the extraction rule holding means is returned to the search client device, and the content When the storage location information is not the storage location information of the specific content, it is preferable to return the text of the content registered in the specific search index to the search client device. According to this configuration, for example, when the content storage location information is specific content storage location information in which the content is frequently updated, the search client device side can update the latest information from the communication network. The content text can be extracted and displayed.

また、検索手段は、特定の抽出ルールをリンク情報で返却し、情報取得手段は、検索手段により返却されるリンク情報を用いて通信ネットワーク上からリンク情報に対応する抽出ルールを取得し、取得された抽出ルールをキャッシュすることが好ましい。この構成によれば、検索クライアント装置側において、検索サーバ装置側から受信したリンク情報に基づいて、当該リンク情報に対応する抽出ルールを保持して利用することができる。これにより、抽出ルールを検索サーバ装置側から受信しなくても、検索クライアント装置側にキャッシュされた抽出ルールを用いて最新のコンテンツを取得できる。その結果、コンテンツの検索処理を効率化できる。 The search means returns a specific extraction rule as link information, and the information acquisition means acquires and acquires the extraction rule corresponding to the link information from the communication network using the link information returned by the search means. It is preferable to cache the extracted rules. According to this configuration, on the search client device side, based on the link information received from the search server device side, the extraction rule corresponding to the link information can be held and used. Accordingly, the latest content can be acquired using the extraction rule cached on the search client device side without receiving the extraction rule from the search server device side. As a result, the content search process can be made more efficient.

また、情報取得手段は、特定の抽出ルールが前回キャッシュされた日時が所定の基準に照らして新しいか否かを判定し、特定の抽出ルールがキャッシュされた日時が所定の基準に照らして新しくない場合に、検索手段により返却されるリンク情報を用いて通信ネットワーク上からリンク情報に対応する抽出ルールを取得し、取得された抽出ルールを再度キャッシュすることが好ましい。この構成によれば、検索サーバ装置側から返却される特定の抽出ルールが前回キャッシュされた日時が所定の基準に照らして新しくない場合に、検索クライアント装置側において、検索サーバ装置側から受信するリンク情報を用いて通信ネットワーク上からリンク情報に対応する抽出ルールを取得し、検索クライアント装置内にキャッシュされた抽出ルールを更新することができる。これにより、検索クライアント装置において、最新の抽出ルールを用いてコンテンツを取得することができる。その結果、適切にコンテンツを抽出できる。 Further, the information acquisition means determines whether the date and time when the specific extraction rule was cached last time is new according to a predetermined standard, and the date and time when the specific extraction rule is cached is not new according to the predetermined standard. In this case, it is preferable to acquire the extraction rule corresponding to the link information from the communication network using the link information returned by the search means, and cache the acquired extraction rule again. According to this configuration, the link received from the search server device side on the search client device side when the date and time when the specific extraction rule returned from the search server device side was cached last time is not new according to a predetermined standard The extraction rule corresponding to the link information can be acquired from the communication network using the information, and the extraction rule cached in the search client device can be updated. Thereby, in a search client apparatus, a content can be acquired using the newest extraction rule. As a result, content can be extracted appropriately.

また、情報取得手段は、特定の検索インデックスに対応するコンテンツの格納先情報が特定のコンテンツの格納先情報であるか否かを判定し、コンテンツの格納先情報が特定のコンテンツの格納先情報である場合には、検索手段により返却されるリンク情報を用いて通信ネットワーク上からリンク情報に対応する抽出ルールを取得し、取得された抽出ルールを再度キャッシュすることが好ましい。この構成によれば、特定の検索インデックスにより示されるコンテンツの格納先情報が特定のコンテンツの格納先情報である場合に、検索クライアント装置側において、検索クライアント装置内にキャッシュされた抽出ルールを更新することができる。これにより、例えばコンテンツの格納先情報が、コンテンツの内容が頻繁に更新されているような特定のコンテンツの格納先情報である場合に、検索クライアント装置側において、通信ネットワーク上から取得した最新の抽出ルールを用いてコンテンツの項目を抽出して表示することができる。その結果、コンテンツの内容の更新に合わせて適切にコンテンツを抽出できる。 Further, the information acquisition means determines whether or not the content storage location information corresponding to the specific search index is the specific content storage location information, and the content storage location information is the specific content storage location information. In some cases, it is preferable to acquire the extraction rule corresponding to the link information from the communication network using the link information returned by the search means, and cache the acquired extraction rule again. According to this configuration, when the content storage location information indicated by the specific search index is the specific content storage location information, the search client device updates the extraction rule cached in the search client device. be able to. Thus, for example, when the storage location information of content is storage location information of a specific content in which the content content is frequently updated, the latest extraction acquired from the communication network on the search client device side Content items can be extracted and displayed using rules. As a result, the content can be appropriately extracted in accordance with the update of the content.

本発明によれば、検索結果として最新のコンテンツの本文を表示することが可能なコンテンツ検索システム、コンテンツ検索方法、及びコンテンツ検索プログラムを提供することができる。 According to the present invention, it is possible to provide a content search system, a content search method, and a content search program that can display the text of the latest content as a search result.

一実施形態に係るＷｅｂコンテンツ検索システムの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the web content search system which concerns on one Embodiment. 図１に示す検索サーバ装置及び検索クライアント装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the search server apparatus and search client apparatus which are shown in FIG. 図１に示すＷｅｂコンテンツ検索システムによる基本的な検索処理の動作を示すシーケンス図である。It is a sequence diagram which shows operation | movement of the basic search process by the Web content search system shown in FIG. 図１に示す検索サーバ装置がＷｅｂコンテンツの本文の検索インデックス登録日時情報に基づき検索応答を行う処理手順を説明するフローチャートである。3 is a flowchart for explaining a processing procedure in which the search server device shown in FIG. 1 makes a search response based on search index registration date / time information in the text of Web content. 図４に示す処理手順において検索サーバ装置が作成する検索リストを示す図である。It is a figure which shows the search list which a search server apparatus produces in the process sequence shown in FIG. 図１に示す検索サーバ装置がＷｅｂコンテンツのＵＲＬに基づき検索応答を行う処理手順を説明するフローチャートである。3 is a flowchart for explaining a processing procedure in which the search server device shown in FIG. 1 makes a search response based on a URL of Web content. 図６に示す処理手順において検索サーバ装置が作成する検索リストを示す図である。It is a figure which shows the search list which a search server apparatus produces in the process sequence shown in FIG. 図１に示す検索サーバ装置が抽出ルールの複雑さに基づき検索応答を行う処理手順を説明するフローチャートである。It is a flowchart explaining the process sequence in which the search server apparatus shown in FIG. 1 performs a search response based on the complexity of an extraction rule. 図８に示す処理手順において検索サーバ装置が作成する検索リストを示す図である。It is a figure which shows the search list which a search server apparatus produces in the process sequence shown in FIG. 図１に示す検索クライアント装置が抽出ルールをキャッシュする場合の検索処理の動作を示すシーケンス図である。It is a sequence diagram which shows the operation | movement of a search process in case the search client apparatus shown in FIG. 1 caches an extraction rule. 図１に示す検索クライアント装置が抽出ルールのキャッシュされた日時情報に基づき抽出ルールのキャッシュを更新する場合の処理手順を説明するフローチャートである。6 is a flowchart for explaining a processing procedure when the search client device shown in FIG. 1 updates the extraction rule cache based on the cached date and time information of the extraction rule. 図１１に示す処理手順において検索サーバ装置から受信する検索結果リストを示す図である。It is a figure which shows the search result list | wrist received from a search server apparatus in the process sequence shown in FIG. 図１に示す検索クライアント装置がＷｅｂコンテンツのＵＲＬに基づき抽出ルールのキャッシュを更新する場合の処理手順を説明するフローチャートである。6 is a flowchart for explaining a processing procedure when the search client device shown in FIG. 1 updates an extraction rule cache based on a URL of Web content. 図１３に示す処理手順において検索サーバ装置から受信する検索結果リストを示す図である。It is a figure which shows the search result list | wrist received from a search server apparatus in the process sequence shown in FIG.

以下、添付図面を参照して、本発明の好適な実施形態について詳細に説明する。なお、図面の説明においては同一要素には同一符号を付し、重複する説明を省略する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant description is omitted.

まず、本発明の一実施形態に係るコンテンツ検索システムの機能的構成について、図１を参照して説明する。図１は、一実施形態に係るＷｅｂコンテンツ検索システムの機能的構成を示すブロック図である。図１に示すように、Ｗｅｂコンテンツ検索システム１は、検索サーバ装置１０及び検索クライアント装置２０で構成されている。検索サーバ装置１０と検索クライアント装置２０とは、通信ネットワーク３０で互いに接続されている。通信ネットワーク３０内には、複数のＷｅｂコンテンツを保持するサーバが含まれる。 First, a functional configuration of a content search system according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a functional configuration of a Web content search system according to an embodiment. As shown in FIG. 1, the Web content search system 1 includes a search server device 10 and a search client device 20. The search server device 10 and the search client device 20 are connected to each other via a communication network 30. The communication network 30 includes a server that holds a plurality of Web contents.

検索サーバ装置１０は、機能的構成要素として、情報収集部１１と、抽出ルール保持部１２と、テキスト抽出部１３と、検索インデックス登録部１４と、検索部１５と、を有している。検索クライアント装置２０は、機能的構成要素として、情報取得部２１と、最新テキスト抽出部２２と、表示部２３と、を有している。以下、各構成について詳細に説明する。 The search server device 10 includes an information collection unit 11, an extraction rule holding unit 12, a text extraction unit 13, a search index registration unit 14, and a search unit 15 as functional components. The search client device 20 includes an information acquisition unit 21, a latest text extraction unit 22, and a display unit 23 as functional components. Hereinafter, each configuration will be described in detail.

情報収集部１１は、通信ネットワーク３０上からＷｅｂコンテンツを取得する情報収集手段である。情報収集部１１は、例えばインターネットやイントラネット等の通信ネットワーク３０上で提供又は配信されている文書データ、マルチメディアデータ等を含むＷｅｂコンテンツをＷｅｂサイトとして機能するサーバから随時取得する。 The information collection unit 11 is an information collection unit that acquires Web content from the communication network 30. The information collection unit 11 acquires Web content including document data, multimedia data, etc. provided or distributed on a communication network 30 such as the Internet or an intranet from a server functioning as a Web site as needed.

抽出ルール保持部１２は、情報収集部１１によって取得されたＷｅｂコンテンツの本文を抽出するための抽出ルールを、Ｗｅｂコンテンツの格納先情報に関連付けて保持する抽出ルール保持手段である。Ｗｅｂコンテンツの本文とは、本来的にそのＷｅｂサイトが提供又は配信したい情報が含まれる部分であり、例えば広告部分やメニュー部分等の不要な部分を排除した部分である。また、本発明におけるＷｅｂコンテンツの本文には、本文そのものだけでなく、タイトルや画像ＵＲＬなども含まれる。Ｗｅｂコンテンツの本文を抽出するための抽出ルールには、従来から知られているものが適用可能であり、例えば特許文献１（特開２００４−２２０２５１号公報）に記載された抽出ルールを用いることができる。Ｗｅｂコンテンツの格納先情報とは、例えばＷｅｂコンテンツの格納先を示すＵＲＬ（ＵｎｉｆоｒｍＲｅｓоｕｒｃｅＬоｃａｔоｒ）である。抽出ルールは、ＵＲＬに関連付けて保持される。また、抽出ルール保持部１２は、抽出ルールをリンク情報として保持してもよい。抽出ルール保持部１２は、例えば、ＷｅｂコンテンツのＵＲＬ、抽出ルール、及び抽出ルールを更新した日時を示す更新日時情報を互いに関連付けたデータベースとして有している。 The extraction rule holding unit 12 is an extraction rule holding unit that holds an extraction rule for extracting the text of the Web content acquired by the information collecting unit 11 in association with the storage destination information of the Web content. The text of the Web content is a part that originally includes information that the website wants to provide or distribute, and is a part that excludes unnecessary parts such as an advertisement part and a menu part. In addition, the text of the Web content in the present invention includes not only the text itself but also a title and an image URL. As the extraction rule for extracting the text of the Web content, a conventionally known extraction rule can be applied. For example, the extraction rule described in Patent Document 1 (Japanese Patent Laid-Open No. 2004-220251) can be used. it can. The Web content storage location information is, for example, a URL (Uniform Resource License) indicating the storage location of the Web content. The extraction rule is held in association with the URL. Further, the extraction rule holding unit 12 may hold the extraction rule as link information. The extraction rule holding unit 12 has, for example, a database in which URLs of Web content, extraction rules, and update date / time information indicating the date / time when the extraction rules are updated are associated with each other.

テキスト抽出部１３は、抽出ルール保持部１２によって保持される抽出ルールに基づいて、情報収集部１１により取得されたＷｅｂコンテンツの本文を抽出するテキスト抽出手段である。テキスト抽出部１３は、抽出ルールを抽出ルール保持部１２から読み込み、Ｗｅｂコンテンツに対して当該抽出ルールを適用することにより、Ｗｅｂコンテンツの本文を抽出する。 The text extraction unit 13 is a text extraction unit that extracts the body of the Web content acquired by the information collection unit 11 based on the extraction rules held by the extraction rule holding unit 12. The text extraction unit 13 reads the extraction rule from the extraction rule holding unit 12, and applies the extraction rule to the Web content, thereby extracting the text of the Web content.

検索インデックス登録部１４は、テキスト抽出部１３により抽出されたＷｅｂコンテンツの本文を、Ｗｅｂコンテンツの格納先情報を含むコンテンツに関する情報と関連付けて検索インデックスに登録する登録手段である。Ｗｅｂコンテンツの格納先情報を含むコンテンツに関する情報とは、例えばＷｅｂコンテンツのＵＲＬ、及び、Ｗｅｂコンテンツの本文が検索インデックスに登録された日時である登録日時情報などが挙げられる。検索インデックス登録部１４は、例えば、検索用インデックス、ＷｅｂコンテンツのＵＲＬ、検索用インデックスの基となるＷｅｂコンテンツの本文、及び登録日時情報を互いに関連付けたデータベースとして有している。 The search index registration unit 14 is a registration unit that registers the body of the Web content extracted by the text extraction unit 13 in the search index in association with information about the content including the Web content storage location information. Examples of the information related to the content including the storage destination information of the Web content include URL of the Web content and registration date / time information that is the date / time when the text of the Web content is registered in the search index. The search index registration unit 14 has, for example, a database that associates the search index, the URL of the Web content, the text of the Web content that is the basis of the search index, and the registration date and time information.

検索部１５は、検索クライアント装置２０からの検索要求に基づいて検索結果を返却する検索手段である。検索部１５は、Ｗｅｂコンテンツを検索するための検索要求を検索クライアント装置２０から受信すると、その検索要求に基づいて、検索インデックスの中から特定の検索インデックスを抽出する。そして、検索部１５は、特定の検索インデックスに関連付けられたＷｅｂコンテンツのＵＲＬと、当該ＵＲＬを用いて決定される特定の抽出ルールとの組み合わせ、又は、特定の検索インデックスに登録されたＷｅｂコンテンツの本文のいずれかを、Ｗｅｂコンテンツに関する情報に基づいて選択して検索クライアント装置２０に返却する。検索クライアント装置２０に返却する特定の抽出ルールは、抽出ルール保持部１２に保持される抽出ルールの中から、特定の検索インデックスに関連付けられたＷｅｂコンテンツのＵＲＬに基づき抽出される。 The search unit 15 is a search unit that returns a search result based on a search request from the search client device 20. When the search unit 15 receives a search request for searching Web content from the search client device 20, the search unit 15 extracts a specific search index from the search index based on the search request. Then, the search unit 15 combines the URL of the Web content associated with the specific search index and the specific extraction rule determined using the URL, or the Web content registered in the specific search index. One of the texts is selected based on the information about the Web content and returned to the search client device 20. The specific extraction rule to be returned to the search client device 20 is extracted from the extraction rules held in the extraction rule holding unit 12 based on the URL of the Web content associated with the specific search index.

詳細には、検索部１５は、抽出した特定の検索インデックスに対応するＷｅｂコンテンツに関する情報をリスト化し、検索結果リストを作成する。例えば、検索部１５は、検索結果として、該当するＷｅｂコンテンツの検索インデックス及びこれに紐付けられるＷｅｂコンテンツに関する情報をリスト化する。検索結果リストに含まれるＷｅｂコンテンツに関する情報の項目には、例えば、ＷｅｂコンテンツのＵＲＬ、Ｗｅｂコンテンツの本文が登録された日時（登録日時情報）、Ｗｅｂコンテンツの抽出ルール、及びタイトルと本文と画像ＵＲＬとを含むＷｅｂコンテンツの本文が挙げられる。当該項目のうち、ＷｅｂコンテンツのＵＲＬとＷｅｂコンテンツの登録日時情報は、検索インデックス登録部１４によりＷｅｂコンテンツの本文が検索インデックスに登録される際の当該ＷｅｂコンテンツのＵＲＬと登録日時であり、検索インデックスに関連付けられて検索結果リストに設定される。その他の項目は、Ｗｅｂコンテンツに関する情報に基づいて、必要に応じて選択的に設定される。具体的には、上述のＷｅｂコンテンツの登録日時情報やＵＲＬを条件として、Ｗｅｂコンテンツの抽出ルール又はＷｅｂコンテンツの本文のいずれかが選択されて検索結果リストに設定される。 Specifically, the search unit 15 creates a search result list by listing information about Web contents corresponding to the extracted specific search index. For example, the search unit 15 lists the search index of the corresponding Web content and information related to the Web content associated therewith as the search result. The items of information related to the Web content included in the search result list include, for example, the URL of the Web content, the date and time when the text of the Web content was registered (registration date information), the Web content extraction rule, and the title, text, and image URL. And the body of Web content including Among the items, the URL of the Web content and the registration date / time information of the Web content are the URL and registration date / time of the Web content when the text content of the Web content is registered in the search index by the search index registration unit 14. Is set in the search result list. The other items are selectively set as necessary based on information on the Web content. Specifically, either the Web content extraction rule or the Web content text is selected and set in the search result list on the condition of the registration date information and URL of the Web content described above.

検索部１５は、上記のようにして設定した検索結果リストを参照して、検索クライアント装置２０側に情報を抽出して返却する。例えば、検索部１５は、Ｗｅｂコンテンツの登録日時情報に基づき、特定の検索インデックスに登録されたＷｅｂコンテンツの本文が検索インデックス登録部１４により検索インデックスに登録された日時が所定の日時に対して新しいか否かを判定する。所定の日時に対して新しいか否かは、例えば現在日時から一日以内か否かなどとして定める。検索部１５は、Ｗｅｂコンテンツの本文が検索インデックスに登録された日時が所定の日時よりも新しくないと判定した場合には、特定の検索インデックスに対応するＷｅｂコンテンツのＵＲＬと、当該ＵＲＬに紐付けられて抽出ルール保持部１２に保持される抽出ルールとの組み合わせを検索クライアント装置２０に返却する。一方、Ｗｅｂコンテンツの本文が検索インデックスに登録された日時が登録された日時が所定の日時よりも新しいと判定した場合には、特定の検索インデックスに登録されたＷｅｂコンテンツの本文を検索クライアント装置２０に返却する。 The search unit 15 refers to the search result list set as described above, and extracts and returns information to the search client device 20 side. For example, the search unit 15 determines that the date and time when the text of the Web content registered in the specific search index is registered in the search index by the search index registration unit 14 based on the registration date and time information of the Web content is new with respect to a predetermined date and time. It is determined whether or not. Whether or not it is new with respect to a predetermined date and time is determined as whether or not it is within one day from the current date and time, for example. When the search unit 15 determines that the date and time when the text of the Web content is registered in the search index is not newer than the predetermined date and time, the search unit 15 associates the URL of the Web content corresponding to the specific search index and the URL. The combination with the extraction rule held in the extraction rule holding unit 12 is returned to the search client device 20. On the other hand, if it is determined that the date and time when the text of the Web content is registered in the search index is newer than the predetermined date and time, the text of the Web content registered in the specific search index is used as the search client device 20. Return to

また、検索部１５は、特定の検索インデックスに対応するＷｅｂコンテンツのＵＲＬが事前に指定された特定のＵＲＬであるか否かを判定する。検索部１５は、特定の検索インデックスに対応するＷｅｂコンテンツのＵＲＬが特定のＵＲＬである場合には、当該ＵＲＬと、当該ＵＲＬに紐付けられて抽出ルール保持部１２に保持される抽出ルールとの組み合わせを検索クライアント装置２０に返却する。一方、ＷｅｂコンテンツのＵＲＬが特定のＵＲＬでない場合には、特定の検索インデックスに登録されたＷｅｂコンテンツの本文を検索クライアント装置２０に返却する。 Further, the search unit 15 determines whether or not the URL of the Web content corresponding to the specific search index is a specific URL specified in advance. When the URL of the Web content corresponding to the specific search index is a specific URL, the search unit 15 associates the URL with the extraction rule associated with the URL and held in the extraction rule holding unit 12. The combination is returned to the search client device 20. On the other hand, if the URL of the Web content is not a specific URL, the text of the Web content registered in the specific search index is returned to the search client device 20.

また、検索部１５は、特定の検索インデックスに対応するＷｅｂコンテンツの本文を抽出するための抽出ルールが所定の抽出ルールよりも単純か否かを判定する。検索部１５は、特定の検索インデックスにより示されるＷｅｂコンテンツの本文を抽出するための抽出ルールが所定の抽出ルールと比べて単純である場合には、特定の検索インデックスに対応するＷｅｂコンテンツのＵＲＬと、当該ＵＲＬに紐付けられて抽出ルール保持部１２に保持される抽出ルールとの組み合わせを検索クライアント装置２０に返却する。一方、特定の検索インデックスに対応するＷｅｂコンテンツの本文を抽出するための抽出ルールが所定の抽出ルールと比べて複雑である場合には、特定の検索インデックスに登録されたＷｅｂコンテンツの本文を検索クライアント装置２０に返却する。 Further, the search unit 15 determines whether or not the extraction rule for extracting the text of the Web content corresponding to the specific search index is simpler than the predetermined extraction rule. When the extraction rule for extracting the text of the Web content indicated by the specific search index is simpler than the predetermined extraction rule, the search unit 15 determines the URL of the Web content corresponding to the specific search index and The combination with the extraction rule associated with the URL and held in the extraction rule holding unit 12 is returned to the search client device 20. On the other hand, when the extraction rule for extracting the text of the Web content corresponding to the specific search index is more complicated than the predetermined extraction rule, the search client uses the text of the Web content registered in the specific search index. Return to device 20.

また、検索部１５は、Ｗｅｂコンテンツに関する情報に、抽出ルール保持部１２のデータベースから取得した抽出ルールのリンク情報を含めて検索結果リストを作成し、当該検索結果リストを検索クライアント装置２０へ送信する。抽出ルールのリンク情報とは、抽出ルールの格納先を示すものであり、当該格納先を示すＵＲＬなどが挙げられる。 Further, the search unit 15 creates a search result list by including the extraction rule link information acquired from the database of the extraction rule holding unit 12 in the Web content information, and transmits the search result list to the search client device 20. . The extraction rule link information indicates a storage location of the extraction rule, and includes a URL indicating the storage location.

検索クライアント装置２０の情報取得部２１は、検索要求を検索部１５に送信し、その検索結果として検索部１５より返却されるＷｅｂコンテンツに関するＵＲＬと特定の抽出ルールとの組み合わせ又はＷｅｂコンテンツの本文を受信し、受信したＵＲＬを用いて通信ネットワーク３０上からＷｅｂコンテンツを取得する情報取得手段である。 The information acquisition unit 21 of the search client device 20 transmits a search request to the search unit 15, and uses a combination of the URL related to the Web content returned from the search unit 15 and a specific extraction rule or the text of the Web content as a search result. It is an information acquisition unit that receives and acquires Web content from the communication network 30 using the received URL.

また、情報取得部２１は、検索サーバ装置１０から返却される抽出ルールのリンク情報を用いて、通信ネットワーク３０上からリンク情報に対応する抽出ルールを取得する。この場合、情報取得部２１は、取得した抽出ルールを検索クライアント装置２０内でキャッシュ（一時記憶）する。また、情報取得部２１は、キャッシュした際の日時をキャッシュ対象の抽出ルールに対応づけて合わせてキャッシュする。 Further, the information acquisition unit 21 acquires the extraction rule corresponding to the link information from the communication network 30 using the link information of the extraction rule returned from the search server device 10. In this case, the information acquisition unit 21 caches (temporarily stores) the acquired extraction rule in the search client device 20. Further, the information acquisition unit 21 caches the date and time when it is cached in association with the extraction rule to be cached.

また、情報取得部２１は、検索インデックスに対応する特定の抽出ルールが前回キャッシュされた日時が所定の基準に照らして新しいか否かを判定する。抽出ルールが前回キャッシュされた日時は、情報取得部２１により検索クライアント装置２０内で抽出ルールがキャッシュされた直近の日時であり、検索クライアント装置２０内のキャッシュデータが読み出されて特定される。所定の基準は、例えば、検索サーバ装置１０側において抽出ルールが取得（更新）されたルール取得日時である。すなわち、所定の基準であるルール取得日時は、抽出ルール保持部１２のデータベースで保持される更新日時情報に相当する。このルール取得日時に関する情報は、検索部１５により検索結果リストに設定され、情報取得部２１により検索結果に含めて取得される。所定の基準に照らして新しいか否かは、例えば抽出ルールが前回キャッシュされた日時が、検索結果リストに設定されているルール取得日時に対して新しいかで判定される。情報取得部２１は、特定の抽出ルールが前回キャッシュされた日時が検索結果リストに設定されているルール取得日時よりも新しくないと判定した場合に、検索部１５から返却された抽出ルールのリンク情報を用いて、通信ネットワーク３０上から当該リンク情報に対応する抽出ルールを取得する。そして、取得された抽出ルールを再度キャッシュすることにより、抽出ルールを更新する。 Further, the information acquisition unit 21 determines whether or not the date and time when the specific extraction rule corresponding to the search index was cached last time is new according to a predetermined standard. The date and time when the extraction rule was cached last time is the most recent date and time when the extraction rule is cached in the search client device 20 by the information acquisition unit 21, and the cache data in the search client device 20 is read and specified. The predetermined standard is, for example, the rule acquisition date and time when the extraction rule is acquired (updated) on the search server device 10 side. That is, the rule acquisition date and time that is a predetermined reference corresponds to the update date and time information held in the database of the extraction rule holding unit 12. Information regarding the rule acquisition date and time is set in the search result list by the search unit 15 and is acquired by being included in the search result by the information acquisition unit 21. Whether or not it is new according to a predetermined standard is determined, for example, by whether or not the date and time when the extraction rule was cached last time is newer than the rule acquisition date and time set in the search result list. When the information acquisition unit 21 determines that the date and time when the specific extraction rule was cached last time is not newer than the rule acquisition date and time set in the search result list, the link information of the extraction rule returned from the search unit 15 Is used to obtain an extraction rule corresponding to the link information from the communication network 30. Then, the extracted extraction rule is updated by caching the acquired extraction rule again.

また、情報取得部２１は、検索結果リストに含んで検索サーバ装置１０から返却されたＷｅｂコンテンツのＵＲＬが事前に指定された特定のＵＲＬであるか否かを判定する。情報取得部２１は、検索サーバ装置１０から返却されたＷｅｂコンテンツのＵＲＬが特定のＵＲＬであると判定した場合には、検索部１５により返却される抽出ルールのリンク情報を用いて、通信ネットワーク３０上から当該リンク情報に対応する抽出ルールを取得する。そして、取得された抽出ルールを再度キャッシュすることにより、抽出ルールを更新する。 The information acquisition unit 21 determines whether the URL of the Web content included in the search result list and returned from the search server device 10 is a specific URL specified in advance. When it is determined that the URL of the Web content returned from the search server device 10 is a specific URL, the information acquisition unit 21 uses the link information of the extraction rule returned by the search unit 15 to use the communication network 30. The extraction rule corresponding to the link information is acquired from above. Then, the extracted extraction rule is updated by caching the acquired extraction rule again.

最新テキスト抽出部２２は、情報取得部２１によりＷｅｂコンテンツが取得された後に、検索サーバ装置１０から受信した特定の抽出ルールによりＷｅｂコンテンツの本文を抽出する最新テキスト抽出手段である。 The latest text extraction unit 22 is a latest text extraction unit that extracts the text of the Web content according to a specific extraction rule received from the search server device 10 after the Web content is acquired by the information acquisition unit 21.

また、最新テキスト抽出部２２は、情報取得部２１によりＷｅｂコンテンツが取得された後に、検索クライアント装置２０内に当該Ｗｅｂコンテンツに対応づけてキャッシュされている抽出ルールを用いてＷｅｂコンテンツの本文を抽出する。 Further, after the Web content is acquired by the information acquisition unit 21, the latest text extraction unit 22 extracts the text of the Web content using an extraction rule cached in association with the Web content in the search client device 20. To do.

表示部２３は、情報取得部２１により受信されたＷｅｂコンテンツの本文又は最新テキスト抽出部２２により抽出されたＷｅｂコンテンツの本文を表示する表示手段である。すなわち、表示部２３は、検索サーバ装置１０側から返却された内容に応じて表示内容を選択する。 The display unit 23 is a display unit that displays the text of the Web content received by the information acquisition unit 21 or the text of the Web content extracted by the latest text extraction unit 22. That is, the display unit 23 selects display contents according to the contents returned from the search server device 10 side.

図２は、図１に示す検索サーバ装置１０及び検索クライアント装置２０のハードウェア構成を示す図である。図２に示すように、検索サーバ装置１０は、物理的には、ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、入力デバイス１０４、出力デバイス１０５、データ送受信デバイスである通信モジュール１０６、補助記憶装置１０７などを備える。検索サーバ装置１０の各機能は、ＣＰＵ１０１、ＲＡＭ１０３等のハードウェア上に所定のソフトウェアを読み込ませることにより、ＣＰＵ１０１の制御のもとで、通信モジュール１０６、入力デバイス１０４及び出力デバイス１０５を動作させるとともにＲＡＭ１０３におけるデータの読み出し及び書き込みを行うことで実現される。同様にして、検索クライアント装置２０は、物理的には、ＣＰＵ２０１、ＲＯＭ２０２、ＲＡＭ２０３、入力デバイス２０４、出力デバイス２０５、データ送受信デバイスである通信モジュール２０６、補助記憶装置２０７などを備える。検索クライアント装置２０の各機能は、ＣＰＵ２０１、ＲＡＭ２０３等のハードウェア上に所定のソフトウェアを読み込ませることにより、ＣＰＵ２０１の制御のもとで、通信モジュール２０６、入力デバイス２０４及び出力デバイス２０５を動作させるとともにＲＡＭ２０３におけるデータの読み出し及び書き込みを行うことで実現される。 FIG. 2 is a diagram showing a hardware configuration of the search server device 10 and the search client device 20 shown in FIG. As shown in FIG. 2, the search server device 10 physically includes a CPU 101, a ROM 102, a RAM 103, an input device 104, an output device 105, a communication module 106 that is a data transmission / reception device, an auxiliary storage device 107, and the like. Each function of the search server device 10 operates the communication module 106, the input device 104, and the output device 105 under the control of the CPU 101 by reading predetermined software on hardware such as the CPU 101 and the RAM 103. This is realized by reading and writing data in the RAM 103. Similarly, the search client device 20 physically includes a CPU 201, a ROM 202, a RAM 203, an input device 204, an output device 205, a communication module 206 that is a data transmission / reception device, an auxiliary storage device 207, and the like. Each function of the search client device 20 causes the communication module 206, the input device 204, and the output device 205 to operate under the control of the CPU 201 by loading predetermined software onto hardware such as the CPU 201 and the RAM 203. This is realized by reading and writing data in the RAM 203.

次に、図３に示すシーケンス図を用いて、Ｗｅｂコンテンツ検索システム１による基本的なＷｅｂコンテンツ検索方法を説明する。 Next, a basic Web content search method by the Web content search system 1 will be described using the sequence diagram shown in FIG.

図３は、図１に示すＷｅｂコンテンツ検索システム１による基本的な検索処理の動作を示すシーケンス図である。まず、検索サーバ装置１０は、随時Ｗｅｂコンテンツを取得している（情報収集ステップ：Ｓ１）。その一方で、検索サーバ装置１０は、Ｗｅｂコンテンツの本文を抽出するための抽出ルールを、ＷｅｂコンテンツのＵＲＬに関連付けて予め保持及び更新している（抽出ルール保持ステップ：Ｓ２）。また、検索サーバ装置１０は、情報収集ステップにおいてＷｅｂコンテンツを取得する度に、当該Ｗｅｂコンテンツの本文を、抽出ルール保持ステップに保持される抽出ルールに基づいて抽出する（テキスト抽出ステップ：Ｓ３）と共に、当該Ｗｅｂコンテンツの本文を、ＷｅｂコンテンツのＵＲＬを含むＷｅｂコンテンツに関する情報に関連付けて検索インデックスに登録している（登録ステップ：Ｓ４）。 FIG. 3 is a sequence diagram showing the basic search processing operation by the Web content search system 1 shown in FIG. First, the search server device 10 acquires Web content as needed (information collection step: S1). On the other hand, the search server device 10 holds and updates the extraction rule for extracting the text of the Web content in advance in association with the URL of the Web content (extraction rule holding step: S2). Further, every time the Web content is acquired in the information collecting step, the search server device 10 extracts the text of the Web content based on the extraction rule held in the extraction rule holding step (text extraction step: S3). The body of the Web content is registered in the search index in association with information about the Web content including the URL of the Web content (registration step: S4).

一方、検索クライアント装置２０は、ユーザからの検索要求（Ｓ５）を受けると、その要求を検索サーバ装置１０に送信する（情報取得ステップ：Ｓ６）。検索要求は、キーワードなどの検索用の情報を含む。検索サーバ装置１０は、検索要求に応じてＷｅｂコンテンツを検索し、特定の検索インデックスを抽出する（Ｓ７）。そして、特定の検索インデックスに対応するＷｅｂコンテンツに関する情報に基づいて、検索応答として該当するＷｅｂコンテンツのＵＲＬとそれに紐付くテキスト抽出ルールの組合せ、又は、検索応答として該当するＷｅｂコンテンツの本文を、検索クライアント装置２０に返却する（検索ステップ：Ｓ８）。 On the other hand, when receiving the search request (S5) from the user, the search client device 20 transmits the request to the search server device 10 (information acquisition step: S6). The search request includes information for search such as a keyword. The search server device 10 searches the Web content in response to the search request and extracts a specific search index (S7). Then, based on the information about the Web content corresponding to the specific search index, search the combination of the URL of the Web content corresponding to the search response and the text extraction rule associated with the URL, or the text of the Web content corresponding to the search response. It returns to the client device 20 (search step: S8).

検索サーバ装置１０は、検索クライアント装置２０に検索応答を返却する際には、例えば、Ｗｅｂコンテンツの登録日時情報、ＷｅｂコンテンツのＵＲＬ、及び、Ｗｅｂコンテンツの抽出ルールなどの情報に基づき応答する情報を選択する。この際のＷｅｂコンテンツに関する情報に基づく検索サーバ装置１０のより具体的な検索応答の処理手順については、後述する。 When the search server device 10 returns a search response to the search client device 20, for example, information that responds based on information such as Web content registration date and time information, Web content URL, and Web content extraction rules is returned. select. A more specific search response processing procedure of the search server device 10 based on information on the Web content at this time will be described later.

検索クライアント装置２０は、検索サーバ装置１０からの検索応答を受信すると、検索応答に含まれるＵＲＬに対応するＷｅｂコンテンツを、通信ネットワーク３０上に複数存在するＷｅｂサーバ４０から取得する（情報取得ステップ：Ｓ９）。続いて、検索クライアント装置２０は、取得したＷｅｂコンテンツから、検索応答に含まれるテキスト抽出ルールを用いて本文を抽出する（最新テキスト抽出ステップ：Ｓ１０）。そして、検索クライアント装置２０は、当該抽出された本文、又は、検索応答に含まれるＷｅｂコンテンツの本文を検索結果として表示する（表示ステップ：Ｓ１１）。 When receiving the search response from the search server device 10, the search client device 20 acquires a plurality of Web contents corresponding to the URL included in the search response from the Web server 40 existing on the communication network 30 (information acquisition step: S9). Subsequently, the search client device 20 extracts the text from the acquired Web content using the text extraction rule included in the search response (latest text extraction step: S10). Then, the search client device 20 displays the extracted text or the text of the Web content included in the search response as a search result (display step: S11).

次に、図４〜図８に示すフローチャート及び検索結果リストを参照して、Ｗｅｂコンテンツに関する情報に基づく検索サーバ装置１０の具体的な検索応答の処理手順について説明する。まず、Ｗｅｂコンテンツの本文が登録された登録日時情報に基づいて、検索サーバ装置１０が行う検索応答について説明する。図４は図１に示す検索サーバ装置１０がＷｅｂコンテンツの本文の検索インデックス登録日時情報に基づき検索応答を行う処理手順を説明するフローチャート、図５は図４に示す処理手順において検索サーバ装置１０が作成する検索リストを示す図である。 Next, a specific search response processing procedure of the search server device 10 based on information on Web content will be described with reference to flowcharts and search result lists shown in FIGS. First, a search response performed by the search server device 10 based on registration date / time information in which the text of the Web content is registered will be described. FIG. 4 is a flowchart for explaining a processing procedure in which the search server device 10 shown in FIG. 1 makes a search response based on search index registration date / time information of the text of the Web content, and FIG. 5 is a flowchart illustrating the processing procedure shown in FIG. It is a figure which shows the search list to produce.

図４に示すように、検索サーバ装置１０の検索部１５は、検索クライアント装置２０から検索要求を受けると、該当するＷｅｂコンテンツを検索して、図５に示す検索結果リスト１６に挙げられるような各項目をリスト化する（Ｓ１２）。リスト化される各項目は、特定の検索インデックスに対応するＩＤに紐付けられるＷｅｂコンテンツに関する情報である。例えば、検索結果リスト１６の各項目には、検索インデックスに対応するＩＤ、ＷｅｂコンテンツのＵＲＬ、Ｗｅｂコンテンツの登録日時情報、抽出ルール、Ｗｅｂコンテンツのタイトル、Ｗｅｂコンテンツの本文、及びＷｅｂコンテンツの画像ＵＲＬが含まれている。検索結果リスト１６の各項目のうち、ＷｅｂコンテンツのＵＲＬ及びＷｅｂコンテンツの登録日時情報は、検索インデックス登録部１４によりＷｅｂコンテンツの本文が検索インデックスに登録される際に、検索インデックスに関連付けられて設定されている。例えば、検索結果リスト１６のＩＤが”１”に対しては、ＷｅｂコンテンツのＵＲＬとして”ｈｔｔｐ：／／ｘｘｘ．ｃоｍ／ｘｘｘ．ｈｔｍｌ”と、Ｗｅｂコンテンツの登録日時情報として”２０１３−０５−０２Ｔ１１：１１：１１＋０９００”とが対応づけて設定されている。 As shown in FIG. 4, when the search unit 15 of the search server device 10 receives a search request from the search client device 20, the search unit 15 searches the corresponding Web content and is listed in the search result list 16 shown in FIG. Each item is listed (S12). Each item to be listed is information regarding Web contents linked to an ID corresponding to a specific search index. For example, each item of the search result list 16 includes an ID corresponding to the search index, a URL of the Web content, registration information of the Web content, an extraction rule, a title of the Web content, a text of the Web content, and an image URL of the Web content. It is included. Among the items of the search result list 16, the URL of the Web content and the registration date / time information of the Web content are set in association with the search index when the text of the Web content is registered in the search index by the search index registration unit 14. Has been. For example, when the ID of the search result list 16 is “1”, “http://xxx.com/xxx.html” is the URL of the Web content and “2013-05-02” is the registration date / time information of the Web content. T11: 11: 11 + 0900 "is set in association with each other.

続いて、検索部１５は、検索結果リスト１６に含まれるＷｅｂコンテンツの登録日時情報を確認し、例えば現在日時（２０１３年５月１０日）に対して新しいか否かを判定する（Ｓ１３）。検索部１５は、Ｗｅｂコンテンツの本文が登録された日時が現在日時に対して一日以内で新しいと判定した場合には（Ｓ１３；Ｙｅｓ）、検索インデックス登録部１４により検索インデックスに登録されたＷｅｂコンテンツの本文を、検索結果リスト１６に設定する（Ｓ１４）。一方、Ｗｅｂコンテンツの本文が登録された日時が現在日時より一日以上前で新しくないと判定した場合には（Ｓ１３：Ｎо）、抽出ルール保持部１２に保持されている抽出ルールを、検索結果リスト１６に設定する（Ｓ１５）。例えば、検索結果リスト１６のＩＤが”２”及び”３”の欄においては、Ｗｅｂコンテンツの本文が登録された日時が現在日時に対して新しいと判定された結果、Ｗｅｂコンテンツのタイトル、Ｗｅｂコンテンツの本文、及びＷｅｂコンテンツの画像ＵＲＬが設定されている。一方、検索結果リスト１６のＩＤが”１”及び”４”の欄においては、Ｗｅｂコンテンツの本文が登録された日時が現在日時に対して新しくないと判定された結果、抽出ルールが設定されている。 Subsequently, the search unit 15 confirms the registration date / time information of the Web content included in the search result list 16 and determines, for example, whether or not it is new with respect to the current date / time (May 10, 2013) (S13). When the search unit 15 determines that the date and time when the text of the Web content was registered is newer than the current date and time within one day (S13; Yes), the search unit 15 stores the Web registered in the search index. The content body is set in the search result list 16 (S14). On the other hand, if it is determined that the date and time when the text of the Web content is registered is not more than one day before the current date and time (S13: Nо), the extraction rule held in the extraction rule holding unit 12 is used as the search result. The list 16 is set (S15). For example, in the fields of “2” and “3” in the search result list 16, as a result of determining that the date and time when the text of the Web content is registered is newer than the current date and time, the title of the Web content and the Web content And the image URL of the Web content are set. On the other hand, in the fields of “1” and “4” in the search result list 16, the extraction rule is set as a result of determining that the date and time when the text of the Web content is registered is not new with respect to the current date and time. Yes.

続いて、検索部１５は、検索結果リスト１６に未処理のＷｅｂコンテンツが含まれているかどうかを判定する（Ｓ１６）。未処理のＷｅｂコンテンツが含まれていると判定された場合（Ｓ１６；Ｙｅｓ）には、再び、未処理のＷｅｂコンテンツの本文が登録された日時を確認し、現在日時に対して新しいか否かを判定する（Ｓ１３）。検索結果リスト１６に未処理のＷｅｂコンテンツが含まれていないと判定された場合（Ｓ１６；Ｎｏ）には、検索結果リスト１６を検索クライアント装置２０に送信する（Ｓ１７）。 Subsequently, the search unit 15 determines whether or not unprocessed Web content is included in the search result list 16 (S16). If it is determined that unprocessed Web content is included (S16; Yes), the date and time when the text of the unprocessed Web content is registered is confirmed again, and whether or not it is new with respect to the current date and time. Is determined (S13). When it is determined that the unprocessed Web content is not included in the search result list 16 (S16; No), the search result list 16 is transmitted to the search client device 20 (S17).

検索クライアント装置２０においては、情報取得部２１が検索結果リスト１６を参照し、特定の検索インデックスに対応するＩＤについて本文の項目が設定されている場合には、そのまま設定されている本文を読み込む。この場合、表示部２３は、情報取得部２１により読み込まれた検索結果リスト１６の本文を表示する。一方、検索結果リスト１６を参照し、特定の検索インデックスに対応するＩＤについて本文の項目が設定されておらず、抽出ルールの項目が設定されている場合には、情報取得部２１は、検索結果リスト１６に含まれるＵＲＬと抽出ルールとの組み合わせを読み込む。情報取得部２１は、当該ＵＲＬにより通信ネットワーク３０から対応するＷｅｂコンテンツを取得する。最新テキスト抽出部２２は、情報取得部２１で取得されたＷｅｂコンテンツに、情報取得部２１で読み込まれた抽出ルールを適用することにより、最新のＷｅｂコンテンツの本文を抽出する。この場合、表示部２３は、最新テキスト抽出部２２により抽出された最新のＷｅｂコンテンツの本文を表示する。 In the search client device 20, the information acquisition unit 21 refers to the search result list 16, and when a text item is set for an ID corresponding to a specific search index, the set text is read as it is. In this case, the display unit 23 displays the text of the search result list 16 read by the information acquisition unit 21. On the other hand, referring to the search result list 16, when the item of the text is not set for the ID corresponding to the specific search index and the item of the extraction rule is set, the information acquisition unit 21 searches for the search result. A combination of a URL and an extraction rule included in the list 16 is read. The information acquisition unit 21 acquires corresponding Web content from the communication network 30 using the URL. The latest text extraction unit 22 extracts the text of the latest Web content by applying the extraction rule read by the information acquisition unit 21 to the Web content acquired by the information acquisition unit 21. In this case, the display unit 23 displays the text of the latest Web content extracted by the latest text extraction unit 22.

次に、ＷｅｂコンテンツのＵＲＬに基づいて、検索サーバ装置１０が行う検索応答について説明する。図６は図１に示す検索サーバ装置１０がＷｅｂコンテンツのＵＲＬに基づき検索応答を行う処理手順を説明するフローチャート、図７は図６に示す処理手順において検索サーバ装置１０が作成する検索リストを示す図である。 Next, a search response performed by the search server device 10 based on the URL of the Web content will be described. 6 is a flowchart for explaining a processing procedure in which the search server device 10 shown in FIG. 1 makes a search response based on the URL of the Web content. FIG. 7 shows a search list created by the search server device 10 in the processing procedure shown in FIG. FIG.

図６に示すように、検索サーバ装置１０の検索部１５は、検索クライアント装置２０から検索要求を受けると、該当するＷｅｂコンテンツを検索して、図７に示す検索結果リスト１７に挙げられるような各項目をリスト化する（Ｓ２２）。リスト化される各項目は、特定の検索インデックスに対応するＩＤに紐付けられるＷｅｂコンテンツに関する情報である。例えば、検索結果リスト１７の各項目には、検索インデックスに対応するＩＤ、ＷｅｂコンテンツのＵＲＬ、抽出ルール、Ｗｅｂコンテンツのタイトル、Ｗｅｂコンテンツの本文、及びＷｅｂコンテンツの画像ＵＲＬが含まれている。例えば、検索結果リスト１７のＩＤが”１”に対しては、ＷｅｂコンテンツのＵＲＬとして”ｈｔｔｐ：／／ｘｘｘ．ｃоｍ／ｘｘｘ．ｈｔｍｌ”が対応づけて設定されている。 As shown in FIG. 6, when the search unit 15 of the search server device 10 receives a search request from the search client device 20, the search unit 15 searches the corresponding Web content and is listed in the search result list 17 shown in FIG. Each item is listed (S22). Each item to be listed is information regarding Web contents linked to an ID corresponding to a specific search index. For example, each item of the search result list 17 includes an ID corresponding to the search index, a URL of the Web content, an extraction rule, a title of the Web content, a text of the Web content, and an image URL of the Web content. For example, “http://xxx.com/xxx.html” is set in association with the Web content URL for the ID “1” of the search result list 17.

続いて、検索部１５は、検索結果リスト１７に含まれるＷｅｂコンテンツのＵＲＬを確認し、事前に指定される特定のＵＲＬかどうかを判定する（Ｓ２３）。検索部１５は、検索結果リスト１７に含まれるＷｅｂコンテンツのＵＲＬが、事前に指定される特定のＵＲＬでないと判定した場合（Ｓ２３；Ｎｏ）には、検索インデックス登録部１４により検索インデックスに登録されたＷｅｂコンテンツの本文を、検索結果リスト１７に設定する（Ｓ２４。一方、検索結果リスト１７に含まれるＷｅｂコンテンツのＵＲＬが、事前に指定される特定のＵＲＬであると判定した場合（Ｓ２３；Ｙｅｓ）には、抽出ルール保持部１２に保持されている抽出ルールを、検索結果リスト１７に設定する（Ｓ２５）。例えば、検索結果リスト１７のＩＤが”２”及び”３”の欄においては、ＷｅｂコンテンツのＵＲＬが特定のＵＲＬでないと判定された結果、Ｗｅｂコンテンツのタイトル、Ｗｅｂコンテンツの本文、及びＷｅｂコンテンツの画像ＵＲＬが設定されている。一方、検索結果リスト１７のＩＤが”１”及び”４”の欄においては、ＷｅｂコンテンツのＵＲＬが特定のＵＲＬであると判定された結果、抽出ルールが設定されている。 Subsequently, the search unit 15 checks the URL of the Web content included in the search result list 17 and determines whether or not the URL is a specific URL specified in advance (S23). When the search unit 15 determines that the URL of the Web content included in the search result list 17 is not a specific URL specified in advance (S23; No), the search index registration unit 14 registers the URL in the search index. The body of the web content is set in the search result list 17 (S24. On the other hand, when it is determined that the URL of the web content included in the search result list 17 is a specific URL specified in advance (S23; Yes). ), The extraction rule held in the extraction rule holding unit 12 is set in the search result list 17 (S25) For example, in the fields where the ID of the search result list 17 is “2” and “3”, As a result of determining that the URL of the Web content is not a specific URL, the title of the Web content and the text of the Web content On the other hand, when the IDs of the search result list 17 are “1” and “4”, it is extracted as a result of determining that the URL of the Web content is a specific URL. Rules are set.

続いて、検索部１５は、検索結果リスト１７に未処理のＷｅｂコンテンツが含まれているかどうかを判定する（Ｓ２６）。未処理のＷｅｂコンテンツが含まれていると判定された場合（Ｓ２６；Ｙｅｓ）には、再び、未処理のＷｅｂコンテンツのＵＲＬを確認し、事前に指定される特定のＵＲＬか否かを判定する（Ｓ２３）。検索結果リスト１７に未処理のＷｅｂコンテンツが含まれていないと判定された場合（Ｓ２６；Ｎｏ）には、検索結果リスト１７を検索クライアント装置２０に送信する（Ｓ２７）。 Subsequently, the search unit 15 determines whether or not unprocessed Web content is included in the search result list 17 (S26). If it is determined that unprocessed Web content is included (S26; Yes), the URL of the unprocessed Web content is confirmed again to determine whether the URL is a specific URL specified in advance. (S23). When it is determined that the unprocessed Web content is not included in the search result list 17 (S26; No), the search result list 17 is transmitted to the search client device 20 (S27).

検索クライアント装置２０では、情報取得部２１が検索結果リスト１７を参照し、特定の検索インデックスに対応するＩＤについて本文の項目が設定されている場合には、そのまま設定されている本文を読み込む。この場合、表示部２３は、情報取得部２１により読み込まれた検索結果リスト１７の本文を表示する。一方、検索結果リスト１７を参照し、特定の検索インデックスに対応するＩＤについて本文の項目が設定されておらず、抽出ルールの項目が設定されている場合には、情報取得部２１は、検索結果リスト１７に含まれるＵＲＬと抽出ルールとの組み合わせを読み込む。情報取得部２１は、当該ＵＲＬにより通信ネットワーク３０から対応するＷｅｂコンテンツを取得する。最新テキスト抽出部２２は、情報取得部２１で取得されたＷｅｂコンテンツに、情報取得部２１で読み込まれた抽出ルールを適用することにより、最新のＷｅｂコンテンツの本文を抽出する。この場合、表示部２３は、最新テキスト抽出部２２により抽出された最新のＷｅｂコンテンツの本文を表示する。 In the search client device 20, the information acquisition unit 21 refers to the search result list 17, and when the text item is set for the ID corresponding to the specific search index, the set text is read as it is. In this case, the display unit 23 displays the text of the search result list 17 read by the information acquisition unit 21. On the other hand, referring to the search result list 17, when the item of the body is not set for the ID corresponding to the specific search index and the item of the extraction rule is set, the information acquisition unit 21 searches for the search result. A combination of a URL and an extraction rule included in the list 17 is read. The information acquisition unit 21 acquires corresponding Web content from the communication network 30 using the URL. The latest text extraction unit 22 extracts the text of the latest Web content by applying the extraction rule read by the information acquisition unit 21 to the Web content acquired by the information acquisition unit 21. In this case, the display unit 23 displays the text of the latest Web content extracted by the latest text extraction unit 22.

次に、抽出ルールの複雑さに基づいて、検索サーバ装置１０が行う検索応答について説明する。図８は図１に示す検索サーバ装置１０が抽出ルールの複雑さに基づき検索応答を行う処理手順を説明するフローチャート、図９は図８に示す処理手順において検索サーバ装置１０が作成する検索リストを示す図である。 Next, a search response performed by the search server device 10 based on the complexity of the extraction rule will be described. FIG. 8 is a flowchart for explaining a processing procedure in which the search server device 10 shown in FIG. 1 performs a search response based on the complexity of the extraction rule. FIG. 9 shows a search list created by the search server device 10 in the processing procedure shown in FIG. FIG.

図８に示すように、検索サーバ装置１０の検索部１５は、検索クライアント装置２０から検索要求を受けると、該当するＷｅｂコンテンツを検索して、図９に示す検索結果リスト１８に挙げられるような各項目をリスト化する（Ｓ３２）。リスト化される各項目は、特定の検索インデックスに対応するＩＤに紐付けられるＷｅｂコンテンツに関する情報である。例えば、検索結果リスト１８の各項目には、検索インデックスに対応するＩＤ、ＷｅｂコンテンツのＵＲＬ、抽出ルール、Ｗｅｂコンテンツのタイトル、Ｗｅｂコンテンツの本文、及びＷｅｂコンテンツの画像ＵＲＬが含まれている。例えば、検索結果リスト１７のＩＤが”１”に対しては、ＷｅｂコンテンツのＵＲＬとして”ｈｔｔｐ：／／ｘｘｘ．ｃоｍ／ｘｘｘ．ｈｔｍｌ”が対応づけて設定されている。 As shown in FIG. 8, when the search unit 15 of the search server device 10 receives a search request from the search client device 20, the search unit 15 searches the corresponding Web content and is listed in the search result list 18 shown in FIG. 9. Each item is listed (S32). Each item to be listed is information regarding Web contents linked to an ID corresponding to a specific search index. For example, each item of the search result list 18 includes an ID corresponding to the search index, a URL of the Web content, an extraction rule, a title of the Web content, a text of the Web content, and an image URL of the Web content. For example, “http://xxx.com/xxx.html” is set in association with the Web content URL for the ID “1” of the search result list 17.

続いて、検索部１５は、検索結果リスト１８に含まれるＷｅｂコンテンツの本文の抽出ルールを確認し、当該抽出ルールが所定の閾値よりも単純か否かを判定する（Ｓ３３）。Ｗｅｂコンテンツの本文の抽出ルールの確認時には、検索結果リスト１８に含まれるＵＲＬを用いて、当該ＵＲＬに紐付けられて抽出ルール保持部１２で保持される抽出ルールを参照する。抽出ルールが所定の閾値よりも単純か否かの判定は、例えば、抽出ルールを実行する際の処理規模等を数値化し、所定の閾値と比較することにより行う。抽出ルールを実行する際の処理規模を決定する数値の具体例としては、抽出対象となるＷｅｂコンテンツのサイズや抽出ルールのライン数（プログラム規模）等が挙げられる。当該数値が所定の閾値よりも大きい場合には抽出ルールは所定の閾値よりも複雑であり、当該数値が所定の閾値よりも小さい場合には抽出ルールは所定の閾値よりも単純である。検索部１５は、Ｗｅｂコンテンツの本文の抽出ルールが所定の閾値よりも単純でなく、複雑であると判定した場合（Ｓ３３；Ｎｏ）には、検索インデックス登録部１４により検索インデックスに登録されたＷｅｂコンテンツの本文を、検索結果リスト１８に設定する（Ｓ３４）。一方、Ｗｅｂコンテンツの本文の抽出ルールが所定の閾値よりも単純であると判定した場合（Ｓ３３；Ｙｅｓ）には、抽出ルール保持部１２に保持されている抽出ルールを、検索結果リスト１８に設定する（Ｓ３５）。例えば、検索結果リスト１８のＩＤが”２”及び”３”の欄においては、Ｗｅｂコンテンツの本文の抽出ルールが閾値よりも複雑であると判定された結果、Ｗｅｂコンテンツのタイトル、Ｗｅｂコンテンツの本文、及びＷｅｂコンテンツの画像ＵＲＬが設定されている。一方、検索結果リスト１８のＩＤが”１”及び”４”の欄においては、Ｗｅｂコンテンツの本文の抽出ルールが閾値よりも単純であると判定された結果、抽出ルールが設定されている。 Subsequently, the search unit 15 confirms the extraction rule for the text of the Web content included in the search result list 18 and determines whether or not the extraction rule is simpler than a predetermined threshold (S33). When checking the extraction rule for the text of the Web content, the URL contained in the search result list 18 is used to refer to the extraction rule associated with the URL and held in the extraction rule holding unit 12. Whether or not the extraction rule is simpler than a predetermined threshold is determined by, for example, quantifying the processing scale when executing the extraction rule and comparing it with a predetermined threshold. Specific examples of numerical values that determine the processing scale when executing the extraction rule include the size of the Web content to be extracted, the number of extraction rule lines (program scale), and the like. When the numerical value is larger than the predetermined threshold, the extraction rule is more complicated than the predetermined threshold, and when the numerical value is smaller than the predetermined threshold, the extraction rule is simpler than the predetermined threshold. When the search unit 15 determines that the extraction rule of the text of the Web content is not simpler than the predetermined threshold and is complicated (S33; No), the Web registered in the search index by the search index registration unit 14 The content body is set in the search result list 18 (S34). On the other hand, when it is determined that the extraction rule for the text of the Web content is simpler than the predetermined threshold (S33; Yes), the extraction rule held in the extraction rule holding unit 12 is set in the search result list 18. (S35). For example, in the fields where the IDs of the search result list 18 are “2” and “3”, it is determined that the extraction rule for the text of the Web content is more complicated than the threshold value. , And the image URL of the Web content are set. On the other hand, in the fields where the ID of the search result list 18 is “1” and “4”, the extraction rule is set as a result of determining that the extraction rule for the text of the Web content is simpler than the threshold.

続いて、検索部１５は、検索結果リスト１８に未処理のＷｅｂコンテンツが含まれているかどうかを判定する（Ｓ３６）。未処理のＷｅｂコンテンツが含まれていると判定された場合（Ｓ３６；Ｙｅｓ）には、再び、未処理のＷｅｂコンテンツの本文の抽出ルールを確認し、所定の閾値よりも複雑であるか否かを判定する（Ｓ３３）。検索結果リスト１８に未処理のＷｅｂコンテンツが含まれていないと判定された場合（Ｓ３６；Ｎｏ）には、検索結果リスト１８を検索クライアント装置２０に送信する（Ｓ３７）。 Subsequently, the search unit 15 determines whether or not unprocessed Web content is included in the search result list 18 (S36). When it is determined that unprocessed web content is included (S36; Yes), the extraction rules for the body of unprocessed web content are checked again, and whether the content is more complicated than a predetermined threshold value. Is determined (S33). When it is determined that the unprocessed Web content is not included in the search result list 18 (S36; No), the search result list 18 is transmitted to the search client device 20 (S37).

検索クライアント装置２０では、情報取得部２１が検索結果リスト１８を参照し、特定の検索インデックスに対応するＩＤについて本文の項目が設定されている場合には、そのまま設定されている本文を読み込む。この場合、表示部２３は、情報取得部２１により読み込まれた検索結果リスト１８の本文を表示する。一方、検索結果リスト１８を参照し、特定の検索インデックスに対応するＩＤについて本文の項目が設定されておらず、抽出ルールの項目が設定されている場合には、情報取得部２１は、検索結果リスト１８に含まれるＵＲＬと抽出ルールとの組み合わせを読み込む。情報取得部２１は、当該ＵＲＬにより通信ネットワーク３０から対応するＷｅｂコンテンツを取得する。最新テキスト抽出部２２は、情報取得部２１で取得されたＷｅｂコンテンツに、情報取得部２１で読み込まれた抽出ルールを適用することにより、最新のＷｅｂコンテンツの本文を抽出する。この場合、表示部２３は、最新テキスト抽出部２２により抽出された最新のＷｅｂコンテンツの本文を表示する。 In the search client device 20, the information acquisition unit 21 refers to the search result list 18, and when a text item is set for an ID corresponding to a specific search index, the set text is read as it is. In this case, the display unit 23 displays the text of the search result list 18 read by the information acquisition unit 21. On the other hand, referring to the search result list 18, when the text item is not set for the ID corresponding to the specific search index, and the extraction rule item is set, the information acquisition unit 21 searches the search result list 18. A combination of a URL and an extraction rule included in the list 18 is read. The information acquisition unit 21 acquires corresponding Web content from the communication network 30 using the URL. The latest text extraction unit 22 extracts the text of the latest Web content by applying the extraction rule read by the information acquisition unit 21 to the Web content acquired by the information acquisition unit 21. In this case, the display unit 23 displays the text of the latest Web content extracted by the latest text extraction unit 22.

次に、図１０に示すシーケンス図を用いて、検索クライアント装置２０側でＷｅｂコンテンツの本文の抽出ルールをキャッシュする場合の検索処理の動作を説明する。図１０は、図１に示す検索クライアント装置２０が抽出ルールをキャッシュする場合の検索処理の動作を示すシーケンス図である。 Next, the operation of search processing when the search rule of the Web content body is cached on the search client device 20 side will be described using the sequence diagram shown in FIG. FIG. 10 is a sequence diagram showing an operation of search processing when the search client device 20 shown in FIG. 1 caches the extraction rule.

まず、検索サーバ装置１０が行う情報収集ステップ（Ｓ１）、抽出ルール保持ステップ（Ｓ２）、テキスト抽出ステップ（Ｓ３）、登録ステップ（Ｓ４）の処理手順は、図３に示す処理手順と同様である。 First, the processing procedure of the information collection step (S1), extraction rule holding step (S2), text extraction step (S3), and registration step (S4) performed by the search server device 10 is the same as the processing procedure shown in FIG. .

図１０に示すように、検索クライアント装置２０は、ユーザからの検索要求（Ｓ５）を受けると、その要求を検索サーバ装置１０に送信する（情報取得ステップ：Ｓ６）。検索要求は、キーワードなどの検索用の情報を含む。検索サーバ装置１０は、検索要求に応じてＷｅｂコンテンツを検索し、特定の検索インデックスを抽出する（Ｓ７）。そして、特定の検索インデックスに対応するＷｅｂコンテンツに関する情報に基づいて、検索応答として該当するＷｅｂコンテンツのＵＲＬ、それに紐付く抽出ルールのＵＲＬ、及び、抽出ルールの更新日時情報を、検索クライアント装置２０に返却する（検索ステップ：Ｓ８）。 As shown in FIG. 10, upon receiving a search request (S5) from the user, the search client device 20 transmits the request to the search server device 10 (information acquisition step: S6). The search request includes information for search such as a keyword. The search server device 10 searches the Web content in response to the search request and extracts a specific search index (S7). Then, based on the information about the Web content corresponding to the specific search index, the URL of the Web content corresponding to the search response, the URL of the extraction rule associated with the URL, and the update date / time information of the extraction rule are sent to the search client device 20. Return (search step: S8).

検索クライアント装置２０は、検索サーバ装置１０からの検索応答を受信すると、検索応答に含まれるＷｅｂコンテンツのＵＲＬに紐付くＷｅｂコンテンツの抽出ルールが検索クライアント装置２０内にキャッシュされているか否かを確認し（Ｓ１８）、なければ検索サーバ装置１０から受信した抽出ルールのＵＲＬを用いて、対応する抽出ルールを取得する（Ｓ１９）。 When the search client device 20 receives the search response from the search server device 10, the search client device 20 confirms whether the Web content extraction rule associated with the URL of the Web content included in the search response is cached in the search client device 20. If not (S18), the corresponding extraction rule is obtained using the URL of the extraction rule received from the search server device 10 (S19).

また、検索クライアント装置２０は、検索応答に含まれるＷｅｂコンテンツの抽出ルールが検索クライアント装置２０内にキャッシュされている場合でも、抽出ルールの更新日時情報やＷｅｂコンテンツのＵＲＬなどを条件として、抽出ルールのＵＲＬを用いて対応する抽出ルールを取得して、再度キャッシュする。これにより、検索クライアント装置２０内でキャッシュされている抽出ルールを更新する。なお、抽出ルールの更新日時情報やＷｅｂコンテンツのＵＲＬなどの情報に基づく検索クライアント装置２０のより具体的な抽出ルール更新の処理手順については後述する。 In addition, even when the Web content extraction rule included in the search response is cached in the search client device 20, the search client device 20 uses the update date and time information of the extraction rule, the URL of the Web content, etc. as a condition. The corresponding extraction rule is acquired using the URL of, and cached again. As a result, the extraction rule cached in the search client device 20 is updated. It should be noted that a more specific extraction rule update processing procedure of the search client device 20 based on information such as extraction rule update date / time information and Web content URL will be described later.

また、検索クライアント装置２０は、検索応答に含まれるＷｅｂコンテンツのＵＲＬを用いて、Ｗｅｂコンテンツを通信ネットワーク３０上に複数存在するＷｅｂサーバ４０から取得する（情報取得ステップ：Ｓ９）。検索クライアント装置２０は、ＷｅｂコンテンツのＵＲＬから取得したＷｅｂコンテンツから、抽出ルールのＵＲＬから取得した抽出ルールを適用することにより、本文を抽出する（最新テキスト抽出ステップ：Ｓ１０）。そして、検索クライアント装置２０は、当該抽出された本文、又は、検索応答に含まれるＷｅｂコンテンツの本文を検索結果として表示する（表示ステップ：Ｓ１１）。 Further, the search client device 20 acquires a plurality of Web contents from the Web server 40 existing on the communication network 30 using the URL of the Web contents included in the search response (information acquisition step: S9). The search client device 20 extracts the text by applying the extraction rule acquired from the URL of the extraction rule from the Web content acquired from the URL of the Web content (latest text extraction step: S10). Then, the search client device 20 displays the extracted text or the text of the Web content included in the search response as a search result (display step: S11).

次に、図１１〜図１４に示すフローチャート及び検索結果リストを用いて、検索クライアント装置２０が、抽出ルールのキャッシュされた日時情報やＷｅｂコンテンツのＵＲＬなどの情報に基づき、検索クライアント装置２０内にキャッシュされている抽出ルールを更新する場合の処理手順について詳細に説明する。 Next, using the flowcharts and search result lists shown in FIGS. 11 to 14, the search client device 20 stores the search rule information in the search client device 20 based on information such as the cached date and time information of the extraction rule and the URL of the Web content. A processing procedure for updating a cached extraction rule will be described in detail.

まず、更新日時情報に基づいて、検索クライアント装置２０が抽出ルールのキャッシュを更新する場合の処理手順について説明する。図１１は図１に示す検索クライアント装置２０が抽出ルールのキャッシュされた日時情報に基づき抽出ルールのキャッシュを更新する場合の処理手順を説明するフローチャート、図１２は図１１に示す処理手順において検索サーバ装置１０から受信する検索結果リストを示す図である。 First, a processing procedure when the search client device 20 updates the extraction rule cache based on the update date / time information will be described. FIG. 11 is a flowchart for explaining the processing procedure when the search client device 20 shown in FIG. 1 updates the extraction rule cache based on the cached date and time information of the extraction rule, and FIG. 12 is a search server in the processing procedure shown in FIG. FIG. 6 is a diagram showing a search result list received from the device 10.

図１１に示すように、検索クライアント装置２０は、検索サーバ装置１０に検索要求を行い、図１２に示すような検索結果リスト２４を受信する（Ｓ４２）。検索結果リスト２４の各項目は、特定の検索インデックスに対応するＩＤに紐付けられるＷｅｂコンテンツに関する情報である。例えば、検索結果リスト２４の各項目には、検索インデックスに対応するＩＤ、ＷｅｂコンテンツのＵＲＬ、抽出ルールのリンク情報、及びルール取得日時が含まれている。ここで、上述したように、ルール取得日時とは、抽出ルールが抽出ルール保持部１２で取得された日時であり、抽出ルールが更新された日時に相当する。例えば、検索結果リスト１７のＩＤが”１”に対しては、ＷｅｂコンテンツのＵＲＬとして”ｈｔｔｐ：／／ｘｘｘ．ｃоｍ／ｘｘｘ．ｈｔｍｌ”と、抽出ルールのリンク情報として”ｘｘｘ＿ｃоｍ．ｙｍｌ”と、ルール取得日時として”２０１３−０５−１０Ｔ１１：１１：１１＋０９００”とが対応づけて設定されている。 As shown in FIG. 11, the search client device 20 makes a search request to the search server device 10 and receives a search result list 24 as shown in FIG. 12 (S42). Each item of the search result list 24 is information regarding Web content linked to an ID corresponding to a specific search index. For example, each item of the search result list 24 includes an ID corresponding to the search index, a URL of the Web content, link information of the extraction rule, and a rule acquisition date and time. Here, as described above, the rule acquisition date and time is the date and time when the extraction rule is acquired by the extraction rule holding unit 12, and corresponds to the date and time when the extraction rule is updated. For example, when the ID of the search result list 17 is “1”, “http://xxx.com/xxx.html” is the URL of the Web content, and “xxx_comm.xml” is the link information of the extraction rule. As the rule acquisition date and time, “2013-05-10 T11: 11: 11 + 0900” is set in association with each other.

続いて、情報取得部２１は、検索結果リスト２４に含まれる抽出ルールが検索クライアント装置２０内にキャッシュされている場合には、当該検索結果リスト２４に含まれる抽出ルールが前回キャッシュされた日時が抽出ルールの取得日時に対して新しいか否かを判定する（Ｓ４３）。例えば、抽出ルールが前回キャッシュされた日時が抽出ルールの取得日時よりも進んだ日時である場合には、抽出ルールが前回キャッシュされた日時が抽出ルールの取得日時に対して新しいと判定する。また、抽出ルールが前回キャッシュされた日時が抽出ルールの取得日時より遅れた日時である場合には、抽出ルールが前回キャッシュされた日時が抽出ルールの取得日時に対して新しくないと判定する。なお、日時が新しいか否かの判定は、厳密な時刻単位での判定に限られるものではなく、例えば日等の期間単位で判定してもよい。情報取得部２１は、抽出ルールが前回キャッシュされた日時が抽出ルールの取得日時に対して新しくないと判定した場合（Ｓ４３；Ｎｏ）には、検索結果リスト２４に含まれる抽出ルールのＵＲＬに基づいて、通信ネットワーク３０上から改めて抽出ルールを取得し、抽出ルールのキャッシュを更新する（Ｓ４４）。なお、情報取得部２１は、抽出ルールがまだキャッシュされていない場合には、上記判定にかかわらず、抽出ルールを取得してキャッシュする。続いて、情報取得部２１は、検索結果リスト２４に未処理のＷｅｂコンテンツが含まれているかどうかを判定する（Ｓ４５）。未処理のＷｅｂコンテンツが含まれていると判定した場合（Ｓ４５；Ｙｅｓ）には、再び、未処理のＷｅｂコンテンツについて抽出ルールが前回キャッシュされた日時が抽出ルールの取得日時に対して新しいか否かを再び判定する（Ｓ４３）。情報取得部２１は、検索結果リスト２４に未処理のＷｅｂコンテンツが含まれていないと判定し（Ｓ４５；Ｎｏ）、検索結果リスト２４に含まれる全ての検索結果の処理を終えると、検索結果リスト２４に含まれるＷｅｂコンテンツのＵＲＬを用いてＷｅｂコンテンツを取得する（Ｓ４６）。そして、最新テキスト抽出部２２は、検索クライアント装置２０内にキャッシュされた抽出ルールを用いて、情報取得部２１により取得されたＷｅｂコンテンツの本文を抽出する。その後、表示部２３は、当該本文を表示する（Ｓ４６）。 Subsequently, when the extraction rule included in the search result list 24 is cached in the search client device 20, the information acquisition unit 21 determines the date and time when the extraction rule included in the search result list 24 was cached last time. It is determined whether or not the extraction rule acquisition date is new (S43). For example, when the date and time when the extraction rule was cached last time is a date and time advanced from the acquisition date and time of the extraction rule, it is determined that the date and time when the extraction rule was cached last time is newer than the acquisition date and time of the extraction rule. If the date and time when the extraction rule was cached last time is later than the acquisition date and time of the extraction rule, it is determined that the date and time when the extraction rule was cached last time is not new with respect to the acquisition date and time of the extraction rule. Note that the determination of whether the date and time is new is not limited to the determination in strict time units, and may be performed in units of periods such as days. When it is determined that the date and time when the extraction rule was cached last time is not new with respect to the acquisition date and time of the extraction rule (S43; No), the information acquisition unit 21 is based on the URL of the extraction rule included in the search result list 24. Then, the extraction rule is acquired again from the communication network 30, and the extraction rule cache is updated (S44). Note that if the extraction rule has not yet been cached, the information acquisition unit 21 acquires and caches the extraction rule regardless of the above determination. Subsequently, the information acquisition unit 21 determines whether or not unprocessed Web content is included in the search result list 24 (S45). If it is determined that unprocessed Web content is included (S45; Yes), whether or not the date and time when the extraction rule was previously cached for the unprocessed Web content is newer than the acquisition date and time of the extraction rule. Is again determined (S43). The information acquisition unit 21 determines that the unprocessed Web content is not included in the search result list 24 (S45; No), and when the processing of all the search results included in the search result list 24 is finished, the search result list Web content is acquired using the URL of the Web content included in 24 (S46). Then, the latest text extraction unit 22 extracts the text of the Web content acquired by the information acquisition unit 21 using the extraction rule cached in the search client device 20. Thereafter, the display unit 23 displays the text (S46).

次に、ＷｅｂコンテンツのＵＲＬに基づいて、検索クライアント装置２０が抽出ルールのキャッシュを更新する場合の処理手順について説明する。図１３は図１に示す検索クライアント装置がＷｅｂコンテンツのＵＲＬに基づき抽出ルールのキャッシュを更新する場合の処理手順を説明するフローチャート、図１４は図１３に示す処理手順において検索サーバ装置から受信する検索結果リストを示す図である。 Next, a processing procedure when the search client device 20 updates the extraction rule cache based on the URL of the Web content will be described. FIG. 13 is a flowchart for explaining the processing procedure when the search client device shown in FIG. 1 updates the extraction rule cache based on the URL of the Web content, and FIG. 14 is a search received from the search server device in the processing procedure shown in FIG. It is a figure which shows a result list.

図１３に示すように、検索クライアント装置２０は、検索サーバ装置１０に検索要求を行い、図１４に示すような検索結果リスト２５を受信する（Ｓ５２）。検索結果リスト２５の各項目は、特定の検索インデックスに対応するＩＤに紐付けられるＷｅｂコンテンツに関する情報である。例えば、検索結果リスト２５の各項目には、検索インデックスに対応するＩＤ、ＷｅｂコンテンツのＵＲＬ、及び抽出ルールのリンク情報が含まれている。例えば、検索結果リスト１７のＩＤが”１”に対しては、ＷｅｂコンテンツのＵＲＬとして”ｈｔｔｐ：／／ｘｘｘ．ｃоｍ／ｘｘｘ．ｈｔｍｌ”と、抽出ルールのリンク情報として”ｘｘｘ＿ｃоｍ．ｙｍｌ”とが対応付けて設定されている。 As shown in FIG. 13, the search client device 20 makes a search request to the search server device 10 and receives a search result list 25 as shown in FIG. 14 (S52). Each item of the search result list 25 is information regarding Web content linked to an ID corresponding to a specific search index. For example, each item of the search result list 25 includes an ID corresponding to the search index, a URL of the Web content, and link information of the extraction rule. For example, when the ID of the search result list 17 is “1”, “http://xxx.com/xxx.html” is the URL of the Web content, and “xxx_comm.xml” is the link information of the extraction rule. It is set in correspondence.

続いて、情報取得部２１は、当該検索結果リスト２５に含まれるＷｅｂコンテンツのＵＲＬが事前に指定された特定のＵＲＬであるか否かを判定する（Ｓ５３）。情報取得部２１は、ＷｅｂコンテンツのＵＲＬが特定のＵＲＬであると判定した場合（Ｓ５３；Ｙｅｓ）には、検索結果リスト２５に含まれる抽出ルールのＵＲＬから抽出ルールを取得し、抽出ルールのキャッシュを更新する（Ｓ５４）。なお、情報取得部２１は、抽出ルールがまだキャッシュされていない場合には、上記判定にかかわらず、抽出ルールを取得してキャッシュする。続いて、情報取得部２１は、検索結果リスト２５に未処理のＷｅｂコンテンツが含まれているかどうかを判定する（Ｓ５５）。未処理のＷｅｂコンテンツが含まれていると判定した場合（Ｓ５５；Ｙｅｓ）には、再び、未処理のＷｅｂコンテンツのＵＲＬを確認し、特定のＵＲＬであるか否かを判定する（Ｓ５３）。情報取得部２１は、検索結果リスト２５に未処理のＷｅｂコンテンツが含まれていないと判定し（Ｓ５５；Ｎｏ）、検索結果リスト２５に含まれる全ての検索結果の処理を終えると、検索結果リスト２５に含まれるＷｅｂコンテンツのＵＲＬを用いてＷｅｂコンテンツを取得する（Ｓ５６）。そして、最新コンテンツ抽出部２２は、キャッシュに含まれる抽出ルールを用いて、情報取得部２１により取得されたＷｅｂコンテンツの本文を抽出する。その後、表示部２３は、当該本文を表示する（Ｓ５６）。 Subsequently, the information acquisition unit 21 determines whether the URL of the Web content included in the search result list 25 is a specific URL specified in advance (S53). When the information acquisition unit 21 determines that the URL of the Web content is a specific URL (S53; Yes), the information acquisition unit 21 acquires the extraction rule from the URL of the extraction rule included in the search result list 25, and caches the extraction rule. Is updated (S54). Note that if the extraction rule has not yet been cached, the information acquisition unit 21 acquires and caches the extraction rule regardless of the above determination. Subsequently, the information acquisition unit 21 determines whether or not unprocessed Web content is included in the search result list 25 (S55). If it is determined that unprocessed Web content is included (S55; Yes), the URL of the unprocessed Web content is confirmed again to determine whether the URL is a specific URL (S53). The information acquisition unit 21 determines that the unprocessed Web content is not included in the search result list 25 (S55; No), and when the processing of all the search results included in the search result list 25 is finished, the search result list Web content is acquired using the URL of the Web content included in 25 (S56). And the newest content extraction part 22 extracts the text of the web content acquired by the information acquisition part 21 using the extraction rule contained in a cache. Thereafter, the display unit 23 displays the text (S56).

次に、情報処理装置（コンピュータ）を検索サーバ装置１０及び検索クライアント装置２０として動作させるＷｅｂコンテンツ検索プログラムについて説明する。図２に示すようなハードウェア構成を有する情報処理装置である検索サーバ装置１０及び検索クライアント装置２０には、それぞれ、Ｗｅｂコンテンツ検索プログラムが、搬送波に重畳されたコンピュータデータ信号としてネットワークを介して提供される。検索サーバ装置１０及び検索クライアント装置２０は、ネットワークを介して提供されたＷｅｂコンテンツ検索プログラムを、それぞれ、補助記憶装置１０７，２０７等のメモリに格納し、当該Ｗｅｂコンテンツ検索プログラムを実行することができる。検索サーバ装置１０及び検索クライアント装置２０は、メモリに格納されたＷｅｂコンテンツ検索プログラムにアクセス可能になり、当該Ｗｅｂコンテンツ検索プログラムによって、本実施形態の検索サーバ装置１０及び検索クライアント装置２０として動作することが可能になる。 Next, a Web content search program that causes an information processing device (computer) to operate as the search server device 10 and the search client device 20 will be described. Each of the search server device 10 and the search client device 20 which are information processing devices having a hardware configuration as shown in FIG. 2 is provided with a Web content search program via a network as a computer data signal superimposed on a carrier wave. Is done. The search server device 10 and the search client device 20 can store the Web content search program provided via the network in a memory such as the auxiliary storage devices 107 and 207, respectively, and execute the Web content search program. . The search server device 10 and the search client device 20 can access the Web content search program stored in the memory, and operate as the search server device 10 and the search client device 20 of the present embodiment by the Web content search program. Is possible.

また、本発明の実施形態に係るＷｅｂコンテンツ検索プログラムは、記録媒体に格納されて提供されてもよい。記録媒体としては、フロッピー（登録商標）ディスク、ＣＤ−ＲＯＭ、ＤＶＤ、あるいはＲＯＭ等の記録媒体、あるいは半導体メモリ等が例示される。この場合、検索サーバ装置１０及び検索クライアント装置２０には、フロッピー（登録商標）ディスクドライブ装置、ＣＤ−ＲＯＭドライブ装置、ＤＶＤドライブ装置等の読取装置を用いてメモリにＷｅｂコンテンツ検索プログラムが格納される。 The Web content search program according to the embodiment of the present invention may be provided by being stored in a recording medium. Examples of the recording medium include a floppy (registered trademark) disk, a CD-ROM, a DVD, a ROM, or a recording medium, or a semiconductor memory. In this case, the search server device 10 and the search client device 20 store a Web content search program in a memory using a reading device such as a floppy (registered trademark) disk drive device, a CD-ROM drive device, or a DVD drive device. .

以上説明したＷｅｂコンテンツ検索システム１及びこれを用いたＷｅｂコンテンツ検索方法によれば、検索サーバ装置１０側において通信ネットワーク３０上からＷｅｂコンテンツが取得され、ＷｅｂコンテンツのＵＲＬに関連付けて保持された抽出ルールに基づいて、Ｗｅｂコンテンツの本文が抽出される。抽出されたＷｅｂコンテンツの本文は、ＷｅｂコンテンツのＵＲＬを含むＷｅｂコンテンツに関する情報と関連付けて検索インデックスに登録されている。検索サーバ装置１０は、Ｗｅｂコンテンツを検索するための検索要求を検索クライアント装置２０から受信すると、当該検索要求に基づいて、検索インデックスの中から特定の検索インデックスを抽出し、当該検索インデックスに対応する情報を検索クライアント装置２０へ返却する。この際、検索サーバ装置１０は、Ｗｅｂコンテンツに関する情報に基づいて、特定の検索インデックスに対応するＷｅｂコンテンツのＵＲＬと、該ＵＲＬに対応する特定の抽出ルールとの組み合わせ、又は、特定の検索インデックスに登録されたＷｅｂコンテンツの本文のいずれかを、検索クライアント装置２０へ返却する情報とすることができる。検索クライアント装置２０側では、検索要求を検索サーバ装置１０に送信し、その検索結果としてＷｅｂコンテンツのＵＲＬと特定の抽出ルールとの組み合わせ又はＷｅｂコンテンツの本文を受信することにより、当該ＵＲＬを用いて通信ネットワーク３０上からＷｅｂコンテンツを取得でき、取得した当該コンテンツについて検索サーバ装置１０から受信した特定の抽出ルールにより本文を抽出することができる。よって、例えば検索サーバ装置１０から受信したＷｅｂコンテンツの本文が最新でない場合などには、必要に応じて検索クライアント装置２０側で最新のＷｅｂコンテンツの本文を取得して表示することができる。また、例えば検索サーバ装置１０から受信したＷｅｂコンテンツの本文が最新である場合には、検索クライアント装置２０側で改めてＷｅｂコンテンツの本文を取得するまでもなく、当該受信したＷｅｂコンテンツの本文を表示することができる。以上より、検索結果として、最新のＷｅｂコンテンツの本文を表示することが可能となる。 According to the Web content search system 1 and the Web content search method using the Web content search system described above, an extraction rule in which Web content is acquired from the communication network 30 on the search server device 10 side and held in association with the URL of the Web content. Based on this, the text of the Web content is extracted. The extracted text of the Web content is registered in the search index in association with information about the Web content including the URL of the Web content. When the search server device 10 receives a search request for searching Web content from the search client device 20, the search server device 10 extracts a specific search index from the search index based on the search request, and corresponds to the search index. Information is returned to the search client device 20. At this time, the search server device 10 uses the combination of the URL of the Web content corresponding to the specific search index and the specific extraction rule corresponding to the URL, or the specific search index based on the information about the Web content. Any text of the registered Web content can be used as information to be returned to the search client device 20. On the search client device 20 side, a search request is transmitted to the search server device 10, and a combination of the URL of the Web content and a specific extraction rule or the text of the Web content is received as the search result, and the URL is used. Web content can be acquired from the communication network 30, and the text can be extracted from the acquired content according to a specific extraction rule received from the search server device 10. Therefore, for example, when the text of the Web content received from the search server device 10 is not the latest, the latest Web content text can be acquired and displayed on the search client device 20 side as necessary. For example, when the text of the Web content received from the search server device 10 is the latest, the text of the received Web content is displayed without acquiring the Web content text again on the search client device 20 side. be able to. As described above, it is possible to display the text of the latest Web content as a search result.

また、Ｗｅｂコンテンツの本文の登録の日時が所定の日時より新しくない場合には、検索サーバ装置１０側から検索クライアント装置２０側に、検索結果として該当するＷｅｂコンテンツのＵＲＬと抽出ルールとの組み合わせが返却されることになる。よって、検索クライアント装置２０側においては、検索サーバ装置１０側から受信したＷｅｂコンテンツのＵＲＬと特定の抽出ルールとを用いることにより、通信ネットワーク３０上から最新のＷｅｂコンテンツの本文を抽出して表示することができる。また、登録の日時が所定の日時より新しい場合には、検索サーバ装置１０側から検索クライアント装置２０側に、検索結果として該当するＷｅｂコンテンツの本文が返却されることになる。よって、検索クライアント装置２０側においては、抽出した日時が所定の日時より新しいＷｅｂコンテンツの本文を表示することができる。以上より、検索結果として、最新のＷｅｂコンテンツの本文を表示することが可能となる。 If the date and time of registration of the text of the Web content is not newer than the predetermined date and time, the combination of the URL of the Web content and the extraction rule corresponding to the search result from the search server device 10 side to the search client device 20 side. Will be returned. Therefore, on the search client device 20 side, by using the URL of the Web content received from the search server device 10 side and a specific extraction rule, the text of the latest Web content is extracted from the communication network 30 and displayed. be able to. When the registration date is newer than the predetermined date, the text of the corresponding Web content is returned as a search result from the search server device 10 side to the search client device 20 side. Therefore, on the search client device 20 side, it is possible to display the text of the Web content whose extracted date is newer than the predetermined date. As described above, it is possible to display the text of the latest Web content as a search result.

また、例えばＷｅｂコンテンツのＵＲＬが、Ｗｅｂコンテンツの内容が頻繁に更新されているような特定のＵＲＬである場合に、検索クライアント装置２０側において、通信ネットワーク３０上から最新のＷｅｂコンテンツの本文を抽出して表示することができる。 Further, for example, when the URL of the Web content is a specific URL such that the content of the Web content is updated frequently, the search client device 20 extracts the latest text of the Web content from the communication network 30. Can be displayed.

また、検索クライアント装置２０側において、検索サーバ装置１０側から受信したリンク情報に基づいて、当該リンク情報に対応する抽出ルールを保持して利用することができる。これにより、抽出ルールを検索サーバ装置１０側から受信しなくても、検索クライアント装置２０側にキャッシュされた抽出ルールを用いて最新のＷｅｂコンテンツを取得できる。その結果、Ｗｅｂコンテンツの検索処理を効率化できる。 On the search client device 20 side, based on the link information received from the search server device 10 side, an extraction rule corresponding to the link information can be held and used. Accordingly, the latest Web content can be acquired using the extraction rule cached on the search client device 20 side without receiving the extraction rule from the search server device 10 side. As a result, Web content search processing can be made more efficient.

また、検索サーバ装置１０側から返却される特定の抽出ルールが前回キャッシュされた日時が所定の基準に照らして新しくない場合に、検索クライアント装置２０側において、検索サーバ装置１０側から受信するリンク情報を用いて通信ネットワーク３０上からリンク情報に対応する抽出ルールを取得し、検索クライアント装置２０内にキャッシュされた抽出ルールを更新することができる。これにより、検索クライアント装置２０において、最新の抽出ルールを用いてＷｅｂコンテンツを取得することができる。その結果、適切にＷｅｂコンテンツを抽出できる。 Further, the link information received from the search server device 10 side on the search client device 20 side when the date and time when the specific extraction rule returned from the search server device 10 side was cached last time is not new according to a predetermined standard. Can be used to acquire the extraction rule corresponding to the link information from the communication network 30 and update the extraction rule cached in the search client device 20. As a result, the search client device 20 can acquire Web content using the latest extraction rule. As a result, Web contents can be extracted appropriately.

また、特定の検索インデックスにより示されるＷｅｂコンテンツのＵＲＬが特定のＵＲＬである場合に、検索クライアント装置２０側において、検索クライアント装置２０内にキャッシュされた抽出ルールを更新することができる。これにより、例えばＷｅｂコンテンツのＵＲＬが、該コンテンツの内容が頻繁に更新されているような特定のＵＲＬである場合に、検索クライアント装置２０側において、通信ネットワーク３０上から取得した最新の抽出ルールを用いてＷｅｂコンテンツの項目を抽出して表示することができる。その結果、Ｗｅｂコンテンツの内容の更新に合わせて適切にＷｅｂコンテンツを抽出できる。 Further, when the URL of the Web content indicated by the specific search index is a specific URL, the extraction rule cached in the search client device 20 can be updated on the search client device 20 side. Thereby, for example, when the URL of the Web content is a specific URL such that the content is frequently updated, the latest extraction rule acquired from the communication network 30 is obtained on the search client device 20 side. Web content items can be extracted and displayed. As a result, the Web content can be appropriately extracted in accordance with the update of the content of the Web content.

以上、本発明の好適な実施形態について説明してきたが、本発明は必ずしも上述した実施形態に限定されるものではなく、その要旨を逸脱しない範囲で様々な変更が可能である。 The preferred embodiments of the present invention have been described above. However, the present invention is not necessarily limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention.

１…Ｗｅｂコンテンツ検索システム、１０…検索サーバ装置、１１…情報収集部（情報収集手段）、１２…抽出ルール保持部（抽出ルール保持手段）、１３…テキスト抽出部（テキスト抽出手段）、１４…検索インデックス登録部（登録手段）、１５…検索部（検索手段）、２０…検索クライアント装置、２１…情報取得部（情報取得手段）、２２…最新テキスト抽出部（最新テキスト抽出手段）、２３…表示部（表示手段）、３０…通信ネットワーク。 DESCRIPTION OF SYMBOLS 1 ... Web content search system, 10 ... Search server apparatus, 11 ... Information collection part (information collection means), 12 ... Extraction rule holding part (extraction rule holding means), 13 ... Text extraction part (text extraction means), 14 ... Search index registration unit (registration unit), 15 ... Search unit (search unit), 20 ... Search client device, 21 ... Information acquisition unit (information acquisition unit), 22 ... Latest text extraction unit (latest text extraction unit), 23 ... Display unit (display means), 30... Communication network.

Claims

A content search system comprising a search server device and a search client device,
The search server device
Information collecting means for acquiring content from a communication network;
An extraction rule holding means for holding an extraction rule for extracting the text of the content acquired by the information collecting means in association with the storage destination information of the content;
Text extraction means for extracting the body of the content acquired by the information collection means based on the extraction rules held by the extraction rule holding means;
Registration means for registering a body of the content extracted by the text extraction means in a search index in association with information on the content including storage destination information of the content;
When a search request for searching for the content is received from the search client device, a specific search index is extracted from the search index based on the search request, and the content corresponding to the specific search index is extracted. Either a combination of storage location information and a specific extraction rule corresponding to the storage location information of the content extracted from the extraction rule holding means, or a text of the content registered in the specific search index Search means for selecting and returning to the client device based on the information about the content;
Have
The search client device includes:
The search request is transmitted to the search unit of the search server device, and a combination of the storage location information of the content returned by the search unit and the specific extraction rule or the text of the content is received as a search result. Information acquisition means for acquiring the content from the communication network using the received storage location information of the content;
Latest text extraction means for extracting the body of the content according to the specific extraction rule received from the search server device after the content is acquired by the information acquisition means;
Display means for displaying the text of the content received by the information acquisition means or the text of the content extracted by the latest text extraction means;
A content search system.

The information related to the content includes registration date and time information related to the date and time when the text of the content was registered in the search index,
The search means determines whether or not the registration date and time of the content body registered in the specific search index is newer than a predetermined date and time based on the registration date and time information. If it is not newer than the predetermined date and time, the storage location information of the content corresponding to the specific search index and the specific extraction rule corresponding to the storage location information of the content extracted from the extraction rule holding means When the date and time of registration is newer than the predetermined date and time, the body of the content registered in the specific search index is returned to the search client device.
The content search system according to claim 1.

The search unit determines whether the storage location information of the content corresponding to the specific search index is storage location information of the specific content, and the storage location information of the content is the storage location of the specific content If it is information, a combination of the storage location information of the specific content and a specific extraction rule corresponding to the storage location information of the specific content extracted from the extraction rule holding unit is stored in the search client device. When the storage location information of the content is not the storage location information of the specific content, the body of the content registered in the specific search index is returned to the search client device.
The content search system according to claim 1.

The search means returns the specific extraction rule as link information,
The information acquisition means acquires the extraction rule corresponding to the link information from the communication network using the link information returned by the search means, and caches the acquired extraction rule.
The content search system as described in any one of Claims 1-3.

The information acquisition means determines whether the date and time when the specific extraction rule was cached last time is new according to a predetermined criterion, and the date and time when the specific extraction rule was cached last time and according to the predetermined criterion. If not new, obtain the extraction rule corresponding to the link information from the communication network using the link information returned by the search means, and cache the obtained extraction rule again,
The content search system according to claim 4.

The information acquisition means determines whether the storage location information of the content corresponding to the specific search index is storage content information of the specific content, and the storage location information of the content stores the specific content. In the case of prior information, the extraction rule corresponding to the link information is acquired from the communication network using the link information returned by the search means, and the acquired extraction rule is cached again.
The content search system according to claim 4.

A content search method for searching for content by a search server device and a search client device,
The search server device
An information collecting step of acquiring content from a communication network;
An extraction rule holding step for holding an extraction rule for extracting the text of the content acquired in the information collecting step in association with the storage destination information of the content;
A text extraction step for extracting the body of the content acquired in the information collection step based on the extraction rule held in the extraction rule holding step;
A registration step of registering the body of the content extracted in the text extraction step in a search index in association with information about the content including storage destination information of the content;
When a search request for searching for the content is received from the search client device, a specific search index is extracted from the search index based on the search request, and the content corresponding to the specific search index is extracted. Either a combination of storage location information and a specific extraction rule corresponding to the storage location information of the content extracted in the extraction rule holding step, or a text of the content registered in the specific search index A search step of selecting based on the storage location information of the content and returning it to the client device;
Including
The search client device is
The search request is transmitted to the search server device, and the combination of the storage location information of the content returned in the search step and the specific extraction rule or the text of the content as a search result is received and received. An information acquisition step of acquiring the content from the communication network using content storage location information;
A latest text extraction step of extracting a body of the content according to the specific extraction rule received from the search server device after the content is acquired in the information acquisition step;
A display step for displaying the body of the content received in the information acquisition step or the body of the content extracted in the latest text extraction step;
Content search method including

A content search program for searching for content by a search server device and a search client device,
A computer that operates as the search server device,
Information collecting means for acquiring content from a communication network;
An extraction rule holding means for holding an extraction rule for extracting the text of the content acquired by the information collecting means in association with the storage destination information of the content;
Text extraction means for extracting the body of the content acquired by the information collection means based on the extraction rules held by the extraction rule holding means;
Registration means for registering a body of the content extracted by the text extraction means in a search index in association with information on the content including storage destination information of the content;
When a search request for searching for the content is received from the search client device, a specific search index is extracted from the search index based on the search request, and the content corresponding to the specific search index is extracted. Either a combination of storage location information and a specific extraction rule corresponding to the storage location information of the content extracted from the extraction rule holding means, or a text of the content registered in the specific search index , Function as a search means to select and return to the client device based on information about the content,
A computer that operates as the search client device;
The search request is transmitted to the search unit of the search server device, and a combination of the storage location information of the content returned by the search unit and the specific extraction rule or the text of the content is received as a search result. Information acquisition means for acquiring the content from the communication network using the received storage location information of the content;
Latest text extraction means for extracting the body of the content according to the specific extraction rule received from the search server device after the content is acquired by the information acquisition means;
A content search program that functions as a display unit that displays the body of the content received by the information acquisition unit or the body of the content extracted by the latest text extraction unit.