JP5226725B2

JP5226725B2 - Search engine and method for confirming status of processing sequentially performed by a plurality of servers

Info

Publication number: JP5226725B2
Application number: JP2010083026A
Authority: JP
Inventors: 正樹北野
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2010-03-31
Filing date: 2010-03-31
Publication date: 2013-07-03
Anticipated expiration: 2030-03-31
Also published as: JP2011215857A

Description

本発明は、複数のサーバにより順次行われる処理の状態を確認するシステム、方法及び検索エンジンに関する。特に、複数のサーバにより順次行われる処理を中断することなく問題を引き起こしているサーバを特定する検索エンジンおよび方法に関する。 The present invention relates to a system, a method, and a search engine for confirming the status of processes sequentially performed by a plurality of servers. In particular, the present invention relates to a search engine and a method for identifying a server causing a problem without interrupting processes sequentially performed by a plurality of servers.

従来、インターネットの検索サイトでは、ユーザによるキーワード入力等の検索要求に応じて、検索結果をリストとしてユーザに提供するサービスが行われている。このような検索サイトの運営側は、通常、クローラ或いはロボットと呼ばれるプログラムを備える検索エンジンを用いて、インターネット上のＷｅｂページにある文書や画像等のデータを収集している。 2. Description of the Related Art Conventionally, Internet search sites provide a service that provides search results to a user as a list in response to a search request such as keyword input by the user. The operation side of such a search site usually collects data such as documents and images on a Web page on the Internet using a search engine having a program called a crawler or a robot.

ここで、検索エンジンに関して、世界中で次々に更新されるＷｅｂページに対応するための技術が提供されている。例えば、特許文献１に記載の技術によれば、アクセスすべき特定サイトに関する情報、当該特定サイトの情報収集を開始する日時を示す情報等を記憶しておき、現在時刻が情報収集開始時刻と一致すると判定した場合に情報収集を開始することで、Ｗｅｂクローラがインターネット上のＷｅｂページを効率良く収集する。 Here, regarding a search engine, a technique for dealing with Web pages updated one after another all over the world is provided. For example, according to the technique described in Patent Document 1, information related to a specific site to be accessed, information indicating the date and time when information collection of the specific site is started, and the current time coincides with the information collection start time. Then, when it is determined that information collection is started, the Web crawler efficiently collects Web pages on the Internet.

特開２００４−３１８７４６号公報JP 2004-318746 A

ところで、検索エンジンでは、このようなＷｅｂクローラにより収集されたデータに対して、メタデータを付加したり、予め定められたブラックリストに掲載されているか否かをチェックする等の様々な処理を行うデータ解析部が設けられ、データ解析部により処理されたデータに索引（インデックス）等をつけてデータベースに記憶する。そして、ユーザによる検索要求に応じた検索結果リストを、当該データベースを参照して作成し、ユーザの端末に送信している。 By the way, the search engine performs various processes such as adding metadata to such data collected by the Web crawler and checking whether or not it is listed in a predetermined black list. A data analysis unit is provided, and an index is added to the data processed by the data analysis unit and stored in the database. Then, a search result list corresponding to the search request by the user is created with reference to the database and transmitted to the user terminal.

ここで、ユーザの端末に送信される検索結果リストでは、検索キーワードに対する優先度（スコア）に基づいて、ヒットしたＷｅｂページが所定の順序（ランキング）で表示される。このような検索キーワードに対する順序は、Ｗｅｂページへのアクセス頻度に多大な影響を与えるため、Ｗｅｂページを管理する管理人にとって非常に重要なものとなっている。そのため、検索サイトの運営側は、検索結果内の順序に対して多大な関心を払っており、異常な順序の変動があった場合には、その原因を早急に特定し対応する必要がある。 Here, in the search result list transmitted to the user's terminal, the hit Web pages are displayed in a predetermined order (ranking) based on the priority (score) for the search keyword. Such an order with respect to the search keyword greatly affects the access frequency to the Web page, and is therefore very important for an administrator who manages the Web page. Therefore, the operation side of the search site pays great attention to the order in the search result, and when there is an abnormal change in the order, it is necessary to quickly identify and deal with the cause.

検索結果リスト内における順序の変動は多数の原因により起こるものであるが、インデックス作成前に行われるデータ解析部の処理にもその一因があることが知られている。ここで、データ解析部は、複数のサーバにより構成され、Ｗｅｂクローラが収集したデータに対して当該複数のサーバが順次処理を実行するよう構成されている。そのため、検索結果リスト内における順序に異常な変動があった場合、データ解析部のどの段階で異常な変動が発生したか確認することが困難であるという問題があった。
また、近年では、検索エンジンに基づくサービスは、停止することなく常時連続して提供されることが求められている。そのため、異常の確認に際しては、データ解析部の処理、すなわち、複数のサーバにより順次行われる処理を中断することなく、問題となるサーバを特定することが望まれる。 The change in the order in the search result list is caused by a number of causes, and it is known that there is also a cause for the processing of the data analysis unit performed before index creation. Here, the data analysis unit is configured by a plurality of servers, and the plurality of servers sequentially execute processing on data collected by the Web crawler. Therefore, when there is an abnormal change in the order in the search result list, there is a problem that it is difficult to confirm at which stage of the data analysis unit the abnormal change has occurred.
Further, in recent years, services based on search engines are required to be continuously provided without stopping. Therefore, when confirming an abnormality, it is desirable to identify the server in question without interrupting the processing of the data analysis unit, that is, the processing sequentially performed by a plurality of servers.

本発明はこのような問題に鑑みてなされたものであり、複数のサーバにより順次行われる処理の状態を確認し、当該順次行われる処理を中断することなく問題を引き起こしているサーバを特定することのできる検索エンジンおよび方法に関する。 The present invention has been made in view of such a problem, and confirms the status of processing sequentially performed by a plurality of servers, and identifies a server causing the problem without interrupting the sequential processing. The present invention relates to a search engine and method capable of performing the above.

本発明では、以下のような解決手段を提供する。 The present invention provides the following solutions.

（１）Ｗｅｂページの巡回を行いクロールデータの収集を行うＷｅｂクローラと、前記Ｗｅｂクローラが収集した前記クロールデータから検索用インデックスに用いられるデータを作成するデータ解析部と、前記データ解析部により作成されたデータに基づいて、検索用インデックスを作成するインデクサと、前記インデクサにより作成された検索用インデックスを用いてユーザからの検索要求に応答する検索サーバと、を備える検索エンジンであって、前記データ解析部は、前記クロールデータを順次処理して前記検索用インデックスに用いられるデータを作成する複数のサーバと、前記複数のサーバの処理結果を集約する集約サーバと、を備え、前記複数のサーバの各々は、処理を実行するたびに、当該処理に伴うトランザクションログを前記クロールデータに対応するＷｅｂページのＵＲＬに対応付けて記憶するトランザクションログ記憶手段と、前記ＵＲＬを指定する状態確認要求を受け付けたことに応じて、当該ＵＲＬに対応付けられたトランザクションログを前記トランザクションログ記憶手段から抽出し、前記集約サーバに送るログ抽出手段と、を備え、前記集約サーバは、受け付けた複数の前記トランザクションログを集約するログ集約手段を備える検索エンジン。 (1) A web crawler that circulates web pages and collects crawl data, a data analysis unit that creates data used for a search index from the crawl data collected by the web crawler, and a data analysis unit A search engine comprising: an indexer that creates a search index based on the generated data; and a search server that responds to a search request from a user using the search index created by the indexer, wherein the data The analysis unit includes a plurality of servers that sequentially process the crawl data to create data used for the search index, and an aggregation server that aggregates the processing results of the plurality of servers. Each time a process is executed, the transaction log associated with that process The in response to the accepted and belt transaction log storage means to store in association with the URL of the Web page corresponding to the crawl data, a status confirmation request specifying the URL, associated with the URL transaction log extracted before Quito transaction log storage hands stage or, et al., and a log extracting means for sending to the aggregation server, the aggregation server includes a log aggregation means for aggregating a plurality of said transaction log accepted search Engine .

（１）の検索エンジンによれば、ＵＲＬにより識別されるデータに対して複数のサーバが処理を実行すると、その処理内容であるトランザクションログがトランザクションログ記憶手段に記憶される。そして、ＵＲＬを指定する状態確認要求を受け付けると、当該ＵＲＬに対応付けられたトランザクションログが抽出され、集約される。これにより、ＵＲＬにより識別されるデータに対して当該複数のサーバの各々が実行した処理の内容を当該抽出されたトランザクションログを介して把握可能となり、当該複数のサーバにより順次行われる処理の状態を当該集約されたトランザクションログを介して確認することができる。
また、状態確認要求を受け付けた場合にトランザクションログを抽出するため、複数のサーバで順次行われる（メイン）処理に影響を与えることがない。
したがって、複数のサーバにより順次行われる処理を中断することなく当該複数のサーバの処理の内容を把握することが可能となり、問題を引き起こしているサーバを特定することができる。 According to the search engine of (1), when a plurality of servers execute processing on data identified by the URL , a transaction log as the processing content is stored in the transaction log storage means. When a status confirmation request specifying a URL is received, transaction logs associated with the URL are extracted and aggregated. As a result, the contents of the processing executed by each of the plurality of servers for the data identified by the URL can be grasped via the extracted transaction log, and the status of the processing sequentially performed by the plurality of servers can be determined. It can be confirmed via the aggregated transaction log.
Further, since a transaction log is extracted when a status confirmation request is received, the (main) processing sequentially performed by a plurality of servers is not affected.
Therefore, it becomes possible to grasp the contents of the processing of the plurality of servers without interrupting the processing sequentially performed by the plurality of servers, and it is possible to identify the server causing the problem.

（２）前記状態確認要求は、前記ＵＲＬの指定に加え、前記複数のサーバを識別するサーバ識別情報の指定を含み、前記ログ抽出手段は、前記状態確認要求に含まれるＵＲＬに対応付けられたトランザクションログを、前記サーバ識別情報により識別されるサーバに設けられたトランザクションログ記憶手段から抽出する（１）に記載の検索エンジン。 (2) The status confirmation request includes designation of server identification information for identifying the plurality of servers in addition to designation of the URL , and the log extracting unit is associated with the URL included in the status confirmation request. The search engine according to (1), wherein a transaction log is extracted from a transaction log storage unit provided in a server identified by the server identification information.

（２）の検索エンジンによれば、複数のサーバのうち状態確認要求に含まれるサーバ識別情報により識別される特定のサーバからトランザクションログを抽出する。これにより、複数のサーバにより順次行われる処理において異常が発生した場合に、その原因となる処理を行ったと推測されるサーバのみの処理内容を確認することができ、システム全体として新たな処理を行うサーバの数を限定して処理の負荷を抑制しつつ、原因を早急に特定することができる。 According to the search engine of (2), a transaction log is extracted from a specific server identified by server identification information included in the status confirmation request among a plurality of servers. As a result, when an abnormality occurs in the processing that is sequentially performed by a plurality of servers, it is possible to confirm the processing contents of only the server that is assumed to have performed the processing that causes it, and perform new processing as the entire system The cause can be quickly identified while limiting the processing load by limiting the number of servers.

（３）Ｗｅｂページの巡回を行いクロールデータの収集を行うＷｅｂクローラと、前記Ｗｅｂクローラが収集した前記クロールデータから検索用インデックスに用いられるデータを作成するデータ解析部と、前記データ解析部により作成されたデータに基づいて、検索用インデックスを作成するインデクサと、前記インデクサにより作成された検索用インデックスを用いてユーザからの検索要求に応答する検索サーバと、を備える検索エンジンであって、前記データ解析部は、前記クロールデータを順次処理して前記検索用インデックスに用いられるデータを作成する複数のサーバと、前記複数のサーバの処理結果を集約する集約サーバと、を備え、前記複数のサーバの各々は、処理を実行するたびに、当該処理に伴うトランザクションログを前記クロールデータに対応するＷｅｂページのＵＲＬに対応付けて記憶するトランザクションログ記憶手段と、前記トランザクションログ記憶手段の各々に記憶されたトランザクションログのうち特定のＵＲＬに対応付けられたトランザクションログを所定の周期で抽出し、前記集約サーバに送るログ抽出手段と、を備え、前記集約サーバは、抽出した前記複数のトランザクションログを、前記ＵＲＬ及び前記複数のサーバを識別するサーバ識別情報に対応付けて集約するログ集約手段を備える検索エンジン。 (3) A web crawler that circulates web pages and collects crawl data, a data analysis unit that creates data used for a search index from the crawl data collected by the web crawler, and a data analysis unit A search engine comprising: an indexer that creates a search index based on the generated data; and a search server that responds to a search request from a user using the search index created by the indexer, wherein the data The analysis unit includes a plurality of servers that sequentially process the crawl data to create data used for the search index, and an aggregation server that aggregates the processing results of the plurality of servers. Each time a process is executed, the transaction log associated with that process A transaction log storage means for storing the URL in association with the URL of the Web page corresponding to the crawl data, and a transaction log associated with a specific URL among the transaction logs stored in each of the transaction log storage means. Log extracting means for extracting the log and sending it to the aggregation server, wherein the aggregation server associates the extracted transaction logs with server identification information for identifying the URL and the plurality of servers. A search engine having log aggregation means for aggregating.

（３）の検索エンジンによれば、所定の周期で抽出される特定のＵＲＬに対応付けられたトランザクションログを確認することで、複数のサーバにより順次行われる処理の状態を確認することができる。 According to the search engine of (3), it is possible to confirm the status of processing sequentially performed by a plurality of servers by confirming a transaction log associated with a specific URL extracted at a predetermined cycle.

（４）ＷｅｂクローラによりＷｅｂページの巡回を行いクロールデータの収集を行い、データ解析部により前記Ｗｅｂクローラが収集した前記クロールデータから検索用インデックスに用いられるデータを作成し、インデクサにより前記データ解析部により作成されたデータに基づいて、検索用インデックスを作成し、検索サーバにより前記インデクサにより作成された検索用インデックスを用いてユーザからの検索要求に応答する検索方法であって、前記データ解析部により、複数のサーバの各々で前記クロールデータを順次処理して前記検索用インデックスに用いられるデータを作成し、集約サーバで前記複数のサーバの処理結果を集約し、前記複数のサーバの各々において、トランザクションログ記憶手段が、処理を実行するたびに、当該処理に伴うトランザクションログを前記クロールデータに対応するＷｅｂページのＵＲＬに対応付けて記憶するステップと、ログ抽出手段が、前記ＵＲＬを指定する状態確認要求を受け付けたことに応じて、当該ＵＲＬに対応付けられたトランザクションログを前記トランザクションログ記憶手段から抽出し、前記集約サーバに送るステップと、前記集約サーバにおけるログ集約手段が、受け付けた複数の前記トランザクションログを集約するステップとを含む検索方法。 (4) A web crawler circulates web pages and collects crawl data, a data analysis unit creates data used for a search index from the crawl data collected by the web crawler, and an indexer creates the data analysis unit A search method for creating a search index based on the data created by the index server and responding to a search request from a user using the search index created by the indexer by a search server, wherein the data analysis unit Each of the plurality of servers sequentially processes the crawl data to create data used for the search index, and the aggregation server aggregates the processing results of the plurality of servers. Each time the log storage means executes a process, The transaction log associated with the process is stored in association with the URL of the Web page corresponding to the crawl data, and the log extracting unit receives the status confirmation request designating the URL. A search method comprising: extracting a transaction log associated with the transaction log storage unit and sending the transaction log to the aggregation server; and a step of log aggregation unit in the aggregation server aggregating the plurality of received transaction logs .

（４）の方法によれば、（１）のシステムと同様の効果を奏する。特に、複数のサーバから構成されるデータ解析部のどの段階で検索結果リスト内における順序の異常な変動が発生したか確認することができる。 According to how the (4), provides the system with similar effects of (1). In particular, it is possible to confirm whether abnormal fluctuations in order in any stage in the search results in a list of configured data analysis unit from multiple servers occurs.

本発明によれば、複数のサーバにより順次行われる処理の状態を確認し、当該順次行われる処理を中断することなく問題を引き起こしているサーバを特定することができる。 According to the present invention, it is possible to confirm the state of processing sequentially performed by a plurality of servers, and to identify a server causing a problem without interrupting the sequential processing.

本発明の検索エンジンの全体構成を示す図である。It is a figure which shows the whole structure of the search engine of this invention. 本発明のデータ解析部の構成を示す図である。It is a figure which shows the structure of the data analysis part of this invention. 本発明のデータ解析部による処理の内容であるトランザクションログ示す図である。It is a figure which shows the transaction log which is the content of the process by the data analysis part of this invention. 本発明のデータ解析部により行われるメイン処理を示すフローチャートである。It is a flowchart which shows the main process performed by the data analysis part of this invention. 本発明のデータ解析部により行われるログ集約処理を示すフローチャートである。It is a flowchart which shows the log aggregation process performed by the data analysis part of this invention. 変形実施形態のデータ解析部により行われるログ集約処理を示すフローチャートである。It is a flowchart which shows the log aggregation process performed by the data analysis part of deformation | transformation embodiment.

（実施形態）
以下、本発明の複数のサーバにより順次処理を行うシステムについて、当該システムの好適な一例であるデータ解析部１１を含む検索エンジン１について説明する。なお、本発明は複数のサーバにより順次処理を行うシステムであれば適用可能であり、検索エンジン１のデータ解析部１１に限られるものではない。 (Embodiment)
Hereinafter, a search engine 1 including a data analysis unit 11 that is a preferred example of the system will be described with respect to a system that sequentially performs processing by a plurality of servers of the present invention. Note that the present invention is applicable to any system that sequentially performs processing by a plurality of servers, and is not limited to the data analysis unit 11 of the search engine 1.

［検索エンジンの構成］
図１は、本発明の検索エンジン１の全体構成を示す図である。
検索エンジン１は、Ｗｅｂクローラ１０と、データ解析部１１と、インデクサ１２と、検索サーバ１３と、から構成される。 Search engine configuration
FIG. 1 is a diagram showing an overall configuration of a search engine 1 according to the present invention.
The search engine 1 includes a web crawler 10, a data analysis unit 11, an indexer 12, and a search server 13.

Ｗｅｂクローラ１０は、インターネット上のＷｅｂページを定期的に巡回するプログラムであり、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）内のリンクをたどり様々なＷｅｂページのデータを自動的に収集する。Ｗｅｂクローラ１０が収集したＷｅｂページのデータ（以下、「クロールデータ」という）は、当該ＷｅｂページのＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）に対応付けられて図示しないクロールデータベース（以下、「データベース」を「ＤＢ」とする）に格納される。本実施の形態では、「所定のキー」としてＵＲＬを用い、「所定のキーにより識別されるデータ」としてクロールデータを用いている。 The Web crawler 10 is a program that periodically circulates Web pages on the Internet, and automatically collects data of various Web pages by following links in HTML (HyperText Markup Language). Web page data collected by the Web crawler 10 (hereinafter referred to as “crawl data”) is associated with a URL (Uniform Resource Locator) of the Web page and is not illustrated (hereinafter, “database” is referred to as “DB”). Stored). In the present embodiment, a URL is used as the “predetermined key”, and crawl data is used as the “data identified by the predetermined key”.

データ解析部１１は、複数の解析サーバ２０Ａ・・・２０Ｎにより構成され、ＵＲＬにより識別されるクロールデータに対して、複数の解析サーバ２０Ａ・・・２０Ｎが順次処理を実行することで、検索用インデックスを生成するための一連の前処理を行う。
前処理として、データ解析部１１は、例えば、クロールデータに対して、ＨＴＭＬタグの除去、メタデータの付加、スコア判定等の処理を順次行う。「ＨＴＭＬタグの除去」は、例えば、タイトルタグ内に規定される当該Ｗｅｂページのタイトルを抽出すること等をいい、また、「メタデータの付加」は、例えば、パスの長さ、当該Ｗｅｂページのキーワード、ＨＴＭＬサイズ等を抽出すること等をいい、また、「スコア判定」は、例えば、アダルトスコアやＮＧスコア等の各種スコアを算出するとともに、算出したスコアが閾値を超えているか否かを判定すること等をいう。なお、スコア判定により閾値を超えていると判定された場合には、対応するスコアに応じた処理が行われ、例えば、ＮＧスコアが閾値を超えている場合には、検索結果として表示しないという処理が行われる。
なお、データ解析部１１（解析サーバ２０）により処理が行われるたびにクロールＤＢが更新される。 The data analysis unit 11 includes a plurality of analysis servers 20A... 20N, and the plurality of analysis servers 20A... 20N sequentially execute processing on the crawl data identified by the URL. Perform a series of preprocessing to generate the index.
As preprocessing, the data analysis unit 11 sequentially performs processing such as HTML tag removal, metadata addition, and score determination on the crawl data, for example. “Removal of HTML tag” refers to, for example, extracting the title of the Web page specified in the title tag, and “Addition of metadata” includes, for example, the length of the path, the Web page. The keyword, HTML size, and the like are extracted, and “score determination” calculates, for example, various scores such as an adult score and an NG score, and whether or not the calculated score exceeds a threshold value. It means judging. If it is determined by the score determination that the threshold is exceeded, a process corresponding to the corresponding score is performed. For example, if the NG score exceeds the threshold, the process is not displayed as a search result. Is done.
The crawl DB is updated each time processing is performed by the data analysis unit 11 (analysis server 20).

インデクサ１２は、データ解析部１１により更新されたクロールＤＢ内のデータ等を解析し、検索用インデックス１２１を作成する。インデクサ１２は、例えば、被リンク情報に基づいて順位付けされた検索用インデックス１２１を作成することができ、また、Ｗｅｂページのキーワードとタイトルとの関係から順位付けされた検索用インデックス１２１を作成することができる。 The indexer 12 analyzes the data in the crawl DB updated by the data analysis unit 11 and creates a search index 121. For example, the indexer 12 can create a search index 121 that is ranked based on the linked information, and also creates a search index 121 that is ranked based on the relationship between the keyword and title of the Web page. be able to.

検索サーバ１３は、ユーザ端末２から受信した検索キーワードに対して、検索用インデックス１２１を参照して検索結果リストを作成し、当該検索結果リストをユーザ端末２に返信する。 The search server 13 creates a search result list with reference to the search index 121 for the search keyword received from the user terminal 2, and returns the search result list to the user terminal 2.

このように、検索エンジン１では、Ｗｅｂクローラ１０、データ解析部１１及びインデクサ１２により作成された検索処理のための検索用インデックス１２１を用いて、ユーザ端末２からの検索要求に応えるように構成されている。 As described above, the search engine 1 is configured to respond to a search request from the user terminal 2 using the search index 121 for search processing created by the Web crawler 10, the data analysis unit 11, and the indexer 12. ing.

［データ解析部の構成］
続いて、本発明の好適な実施形態であるデータ解析部１１の具体的な構成について説明する。図２は、データ解析部１１の構成を示すブロック図である。 [Configuration of data analysis section]
Next, a specific configuration of the data analysis unit 11 which is a preferred embodiment of the present invention will be described. FIG. 2 is a block diagram illustrating a configuration of the data analysis unit 11.

データ解析部１１は、ＵＲＬにより識別されるクロールデータに対して順次処理を行う複数の解析サーバ２０Ａ、２０Ｂ・・・２０Ｎと、解析サーバ２０Ａ、２０Ｂ・・・２０Ｎにおける処理の状態を確認する集約サーバ３０と、を含んで構成される。本実施の形態では、クロールデータに対して、解析サーバ２０Ａから解析サーバ２０Ｂ・・・解析サーバ２０Ｎの順に処理を行っている。 The data analysis unit 11 aggregates a plurality of analysis servers 20A, 20B,... 20N that sequentially process the crawl data identified by the URL, and processing statuses in the analysis servers 20A, 20B,. And the server 30. In the present embodiment, the crawl data is processed in the order of analysis server 20A to analysis server 20B... Analysis server 20N.

解析サーバ２０Ａは、メイン処理手段２１Ａと、トランザクションログ記憶手段２２Ａと、ログ抽出手段２３Ａと、を含んで構成される。
なお、解析サーバ２０Ｂ・・・２０Ｎの構成は、解析サーバ２０Ａの構成と同じである。すなわち、解析サーバ２０Ｂ・・・２０Ｎは、メイン処理手段２１Ｂ・・・２１Ｎと、トランザクションログ記憶手段２２Ｂ・・・２２Ｎと、ログ抽出手段２３Ｂ・・・２３Ｎと、を含んで構成される。以下、それぞれを区別しない場合には、単に「解析サーバ２０」、「メイン処理手段２１」、「トランザクションログ記憶手段２２」、「ログ抽出手段２３」とする。 The analysis server 20A includes a main processing unit 21A, a transaction log storage unit 22A, and a log extraction unit 23A.
The configuration of the analysis servers 20B... 20N is the same as the configuration of the analysis server 20A. That is, the analysis servers 20B ... 20N are configured to include main processing means 21B ... 21N, transaction log storage means 22B ... 22N, and log extraction means 23B ... 23N. Hereinafter, when not distinguishing each, they are simply referred to as “analysis server 20”, “main processing means 21”, “transaction log storage means 22”, and “log extraction means 23”.

メイン処理手段２１は、クロールデータに対して様々な処理を実行し、クロールＤＢを更新する。例えば、メイン処理手段２１は、ＨＴＭＬタグの除去、メタデータの付加、スコア判定等の処理を行い、処理に伴う値をクロールＤＢに入力する。メイン処理手段２１により行われた処理の内容（トランザクションログ）は、次の順序の解析サーバ２０に送信される。 The main processing means 21 executes various processes on the crawl data and updates the crawl DB. For example, the main processing means 21 performs processing such as HTML tag removal, metadata addition, and score determination, and inputs values associated with the processing to the crawl DB. The content of the processing (transaction log) performed by the main processing means 21 is transmitted to the analysis server 20 in the next order.

トランザクションログ記憶手段２２は、前の順序の解析サーバ２０から受信したトランザクションログを記憶する。これにより、解析サーバ２０のメイン処理手段２１がクロールデータに対して処理を実行するたびに、その処理の内容がＵＲＬに対応付けられてトランザクションログ記憶手段２２に記憶されていく。
トランザクションログ記憶手段２２に記憶されたトランザクションログは、メイン処理手段２１により読み出され、その後、メイン処理手段２１により処理が実行される。このようなトランザクションログ記憶手段２２に記憶されたトランザクションログの読み出しと、その後のメイン処理手段２１による処理とが、データ解析部１１の主な処理（検索用インデックスを生成するための一連の前処理）である。 The transaction log storage unit 22 stores the transaction log received from the analysis server 20 in the previous order. Thus, every time the main processing unit 21 of the analysis server 20 executes processing on the crawl data, the contents of the processing are stored in the transaction log storage unit 22 in association with the URL.
The transaction log stored in the transaction log storage unit 22 is read by the main processing unit 21, and thereafter the processing is executed by the main processing unit 21. The reading of the transaction log stored in the transaction log storage unit 22 and the subsequent processing by the main processing unit 21 are the main processing of the data analysis unit 11 (a series of preprocessing for generating a search index). ).

なお、本実施の形態では、トランザクションログ記憶手段２２に、前の順序の解析サーバ２０の行った処理の内容を記憶することとしているが、処理を実行した後に自ら備えるトランザクションログ記憶手段２２に自らが行った処理の内容を記憶することとしてもよい。すなわち、本実施の形態では、解析サーバ２０Ａのメイン処理手段２１Ａが行った処理の内容は、次の順序である解析サーバ２０Ｂのトランザクションログ記憶手段２２Ｂに記憶されることとしているが、解析サーバ２０Ａのメイン処理手段２１Ａが行った処理の内容を解析サーバ２０Ａのトランザクションログ記憶手段２２Ａに記憶することとしてもよい。 In the present embodiment, the transaction log storage unit 22 stores the contents of the processing performed by the analysis server 20 in the previous order. It is good also as memorize | storing the content of the process which performed. That is, in this embodiment, the contents of the processing performed by the main processing means 21A of the analysis server 20A are stored in the transaction log storage means 22B of the analysis server 20B in the following order, but the analysis server 20A The contents of the processing performed by the main processing means 21A may be stored in the transaction log storage means 22A of the analysis server 20A.

ログ抽出手段２３は、集約サーバ３０からのＵＲＬを指定する状態確認要求を受け付けると、トランザクションログ記憶手段２２から、このＵＲＬにより識別されるクロールデータに対応付けられたトランザクションログを抽出し、集約サーバ３０に送信する。 When the log extraction unit 23 receives a status confirmation request for designating a URL from the aggregation server 30, the log extraction unit 23 extracts a transaction log associated with the crawl data identified by this URL from the transaction log storage unit 22, 30.

集約サーバ３０は、ログ要求手段３１と、ログ集約手段３２と、から構成される。 The aggregation server 30 includes a log request unit 31 and a log aggregation unit 32.

ログ要求手段３１は、状態確認要求を解析サーバ２０に対して送信し、当該状態確認要求に対応するトランザクションログを解析サーバ２０から受信する。ここで、状態確認要求には、ＵＲＬに加え、解析サーバ２０Ａ・・・２０Ｎの種別を示すサーバ識別情報（例えば、ＩＰアドレス等）を含むこととしてもよい。状態確認要求にサーバ識別情報が含まれることで、複数の解析サーバ２０Ａ・・・２０Ｎのうち、任意の解析サーバ２０からのみ、トランザクションログを取得することができる。 The log request unit 31 transmits a state confirmation request to the analysis server 20 and receives a transaction log corresponding to the state confirmation request from the analysis server 20. Here, in addition to the URL, the status confirmation request may include server identification information (for example, an IP address) indicating the types of the analysis servers 20A... 20N. By including the server identification information in the status confirmation request, the transaction log can be acquired only from any analysis server 20 among the plurality of analysis servers 20A.

ログ集約手段３２は、取得したトランザクションログを集約することで、ＵＲＬにより識別されるクロールデータに対する処理の状態を管理者が確認可能にする。 The log aggregating unit 32 aggregates the acquired transaction logs so that the administrator can check the processing status for the crawl data identified by the URL.

ここで、図３は、ＵＲＬ「ｈｔｔｐ：／／ｗｗｗ．ａｂｃｄｅｆｇ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ」により識別されるクロールデータに対するトランザクションログを示し、図３（１）は、解析サーバ２０Ａのトランザクションログを示し（解析サーバ２０Ｂのトランザクションログ記憶手段２２Ｂに記憶されている）、図３（２）は、解析サーバ２０Ｂのトランザクションログを示す（解析サーバ２０Ｃのトランザクションログ記憶手段２２Ｃに記憶されている）。
トランザクションログは、クロールＤＢの更新内容を示しており、所定の項目に対して入力する値を規定している。例えば、図３（１）を参照して、処理２００Ａは、クロールＤＢの「ＵＲＬ」項目に対して、「ｈｔｔｐ：／／ｗｗｗ．ａｂｃｄｅｆｇ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ」という値を入力することを意味する。 Here, FIG. 3 shows a transaction log for the crawl data identified by the URL “http://www.abcdefg.co.jp/index.html”, and FIG. 3 (1) shows the transaction log of the analysis server 20A. (Stored in the transaction log storage unit 22B of the analysis server 20B), FIG. 3B shows the transaction log of the analysis server 20B (stored in the transaction log storage unit 22C of the analysis server 20C). .
The transaction log indicates the update contents of the crawl DB, and defines the value to be input for a predetermined item. For example, referring to FIG. 3A, the process 200A inputs a value “http://www.abcdefg.co.jp/index.html” to the “URL” item of the crawl DB. Means.

ログ集約手段３２は、取得したトランザクションログから、各解析サーバ２０により行われた処理の内容を抽出する。例えば、解析サーバ２０Ａのトランザクションログと、解析サーバ２０Ｂのトランザクションログとから、解析サーバ２０Ａにより行われた処理２００を抽出し、解析サーバ２０Ｂにより行われた処理２０１を抽出する。 The log aggregation unit 32 extracts the contents of the processing performed by each analysis server 20 from the acquired transaction log. For example, the process 200 performed by the analysis server 20A is extracted from the transaction log of the analysis server 20A and the transaction log of the analysis server 20B, and the process 201 performed by the analysis server 20B is extracted.

ログ集約手段３２により抽出された各解析サーバ２０の処理から、管理者は、ＵＲＬにより識別されるクロールデータに対する処理の状態を確認することができる。例えば、「ｈｔｔｐ：／／ｗｗｗ．ａｂｃｄｅｆｇ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ」のＷｅｂページが検索エンジン１において検索結果として表示されないという異常が発生した場合には、管理者は、解析サーバ２０Ｂの処理２０１Ａにより、検索結果として表示しないことを示すＮＧスコアの計算結果が所定値以上と判定されたことが原因ではないかと把握することができる。 From the processing of each analysis server 20 extracted by the log aggregating means 32, the administrator can check the processing status for the crawl data identified by the URL. For example, when an abnormality occurs in which the Web page “http://www.abcdefg.co.jp/index.html” is not displayed as a search result in the search engine 1, the administrator processes the analysis server 20B. By 201A, it can be understood that the cause is that the calculation result of the NG score indicating that the search result is not displayed is determined to be greater than or equal to a predetermined value.

このように本実施の形態に係るデータ解析部１１では、ログ抽出手段２３がメイン処理手段２１とは別に独立して、トランザクションログ記憶手段２２からトランザクションログを抽出するため、データ解析部１１の主な処理（検索用インデックスを生成するための一連の前処理）に影響を与えることなく、複数の解析サーバ２０の各々が実行した処理の状態を確認することができる。すなわち、管理者は、解析サーバ２０により順次行われる処理を中断することなく、その処理の内容を把握でき、また、問題を引き起こしている解析サーバ２０を特定することができる。 As described above, in the data analysis unit 11 according to the present embodiment, the log extraction unit 23 extracts the transaction log from the transaction log storage unit 22 independently of the main processing unit 21. The state of the processing executed by each of the plurality of analysis servers 20 can be confirmed without affecting the processing (a series of preprocessing for generating the search index). That is, the administrator can grasp the contents of the processing without interrupting the processing sequentially performed by the analysis server 20 and can identify the analysis server 20 causing the problem.

なお、解析サーバ２０及び集約サーバ３０のハードウェアは、一般的なコンピュータによって構成することができる。一般的なコンピュータは、例えば、制御部として、中央処理装置（ＣＰＵ）を備える他、記憶部として、メモリ（ＲＡＭ、ＲＯＭ）、ハードディスク（ＨＤＤ）及び光ディスク（ＣＤ、ＤＶＤ等）を、ネットワーク通信装置として、各種有線及び無線ＬＡＮ装置を、表示装置として、例えば、液晶ディスプレイ、プラズマディスプレイ等の各種ディスプレイを、入力装置として、例えば、キーボード及びポインティング・デバイス（マウス、トラッキングボール等）を適宜備え、これらはバスラインにより接続されている。このような一般的なコンピュータにおいて、ＣＰＵは、解析サーバ２０及び集約サーバ３０を統括的に制御し、各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現する。 Note that the hardware of the analysis server 20 and the aggregation server 30 can be configured by a general computer. For example, a general computer includes a central processing unit (CPU) as a control unit, and a memory (RAM, ROM), a hard disk (HDD), and an optical disk (CD, DVD, etc.) as a storage unit, and a network communication device. Various wired and wireless LAN devices as display devices, for example, various displays such as liquid crystal displays and plasma displays, and as input devices, for example, keyboards and pointing devices (mouse, tracking balls, etc.) Are connected by a bus line. In such a general computer, the CPU controls the analysis server 20 and the aggregation server 30 in an integrated manner, and reads and executes various programs as appropriate, thereby cooperating with the hardware described above, and according to the present invention. Implement various functions.

［データ解析部の処理］
次に、データ解析部１１の処理について説明する。図４は、各解析サーバ２０により行われるメイン処理のフローチャートであり、図５は、各解析サーバ２０により行われる処理の状態を確認するログ集約処理のフローチャートである。 [Data analysis processing]
Next, processing of the data analysis unit 11 will be described. 4 is a flowchart of main processing performed by each analysis server 20, and FIG. 5 is a flowchart of log aggregation processing for confirming the state of processing performed by each analysis server 20.

図４を参照して、メイン処理について説明する。
Ｓ１：解析サーバ２０の制御部は、前の順序の解析サーバ２０のメイン処理手段２１が実行した処理の内容を示すトランザクションログを受信する。 The main process will be described with reference to FIG.
S1: The control unit of the analysis server 20 receives a transaction log indicating the contents of processing executed by the main processing unit 21 of the analysis server 20 in the previous order.

Ｓ２：続いて、解析サーバ２０の制御部は、受信したトランザクションログを、記憶部（トランザクションログ記憶手段２２）に記憶する。 S2: Subsequently, the control unit of the analysis server 20 stores the received transaction log in the storage unit (transaction log storage unit 22).

Ｓ３：続いて、解析サーバ２０の制御部（メイン処理手段２１）は、記憶部（トランザクションログ記憶手段２２）に記憶されたトランザクションログを読み出し、当該トランザクションログに対応付けられたＵＲＬにより識別されるクロールデータに対して、検索用インデックスを生成するための一連の前処理を行う。 S3: Subsequently, the control unit (main processing unit 21) of the analysis server 20 reads the transaction log stored in the storage unit (transaction log storage unit 22), and is identified by the URL associated with the transaction log. A series of preprocessing for generating a search index is performed on the crawl data.

Ｓ４：続いて、解析サーバ２０の制御部は、メイン処理手段２１による処理の内容（トランザクションログ）を、次の順序の解析サーバ２０に送信し、各解析サーバ２０により行われるメイン処理を終了する。 S4: Subsequently, the control unit of the analysis server 20 transmits the content of the processing (transaction log) by the main processing means 21 to the analysis server 20 in the next order, and ends the main processing performed by each analysis server 20. .

次に、図５を参照して、ログ集約処理について説明する。ログ集約処理は、図４のメイン処理と並列して行われる処理である。 Next, the log aggregation process will be described with reference to FIG. The log aggregation process is a process performed in parallel with the main process of FIG.

Ｓ１１：集約サーバ３０の制御部（ログ要求手段３１）は、解析サーバ２０に対して状態確認要求を送信する。ここで、状態確認要求には、処理の状態を確認したいクロールデータを識別するＵＲＬを少なくとも含むものとする。なお、複数の解析サーバ２０のうち何れで異常が発生したかを推測可能な場合（例えば、特定の解析サーバのバージョンを変更した直後等）には、当該解析サーバ２０を識別するＩＰアドレス等のサーバ識別情報も状態確認要求に含めることが好ましい。 S <b> 11: The control unit (log request unit 31) of the aggregation server 30 transmits a status confirmation request to the analysis server 20. Here, it is assumed that the status confirmation request includes at least a URL for identifying crawl data whose processing status is to be confirmed. In addition, when it is possible to guess which of the plurality of analysis servers 20 has an abnormality (for example, immediately after changing the version of a specific analysis server), an IP address or the like for identifying the analysis server 20 Server identification information is also preferably included in the status confirmation request.

Ｓ１２：続いて、解析サーバ２０の制御部（ログ抽出手段２３）は、状態確認要求を受信すると、記憶部（トランザクションログ記憶手段２２）から、状態確認要求に含まれるＵＲＬに対応付けられたトランザクションログを抽出する。
このとき、状態確認要求にサーバ識別情報が含まれる場合には、当該サーバ識別情報により識別される解析サーバ２０のログ抽出手段２３のみが、トランザクションログを抽出する。なお、状態確認要求にサーバ識別情報が含まれる場合には、そもそも当該サーバ識別情報により識別される解析サーバ２０にのみ状態確認要求が送信され、他の解析サーバ２０には状態確認要求が送信されないこととしてもよい。 S12: Subsequently, when the control unit (log extraction unit 23) of the analysis server 20 receives the state confirmation request, the transaction associated with the URL included in the state confirmation request is stored from the storage unit (transaction log storage unit 22). Extract logs.
At this time, when the server identification information is included in the status confirmation request, only the log extraction unit 23 of the analysis server 20 identified by the server identification information extracts the transaction log. When the server identification information is included in the status confirmation request, the status confirmation request is transmitted only to the analysis server 20 identified by the server identification information in the first place, and the status confirmation request is not transmitted to the other analysis servers 20. It is good as well.

Ｓ１３：続いて、解析サーバ２０の制御部（ログ抽出手段２３）は、抽出したトランザクションログを集約サーバ３０に送信する。 S13: Subsequently, the control unit (log extraction unit 23) of the analysis server 20 transmits the extracted transaction log to the aggregation server 30.

Ｓ１４：続いて、集約サーバ３０の制御部（ログ集約手段３２）は、受信したトランザクションログを集約し、各解析サーバ２０により行われた処理の内容を抽出する。その後、ログ集約処理を終了する。 S14: Subsequently, the control unit (log aggregation means 32) of the aggregation server 30 aggregates the received transaction logs and extracts the contents of the processing performed by each analysis server 20. Thereafter, the log aggregation process ends.

このように、本実施の形態に係るデータ解析部１１では、ＵＲＬにより識別されるクロールデータに対して複数の解析サーバ２０の各々が実行した処理の内容を、トランザクションログを介して特定できる。その結果、複数の解析サーバ２０により順次行われる処理の状態を、管理者が確認することができる。特に、ログ集約処理は、メイン処理と並列して行われるため、メイン処理の流れを止めることなく複数の解析サーバ２０により順次行われる処理の状態を確認できる。 As described above, the data analysis unit 11 according to the present embodiment can specify the contents of the processing executed by each of the plurality of analysis servers 20 on the crawl data identified by the URL via the transaction log. As a result, the administrator can confirm the state of processing sequentially performed by the plurality of analysis servers 20. In particular, since the log aggregation processing is performed in parallel with the main processing, it is possible to check the status of the processing sequentially performed by the plurality of analysis servers 20 without stopping the flow of the main processing.

このとき、ログ抽出手段２３は、ログ要求手段３１から状態確認要求を受け付けた場合にのみトランザクションログを抽出するとともに、メイン処理手段２１とは別に独立してトランザクションログを抽出するため、データ解析部１１の主な処理に影響を与えることなく複数の解析サーバ２０の各々が実行した処理の状態を確認することができる。 At this time, the log extracting unit 23 extracts the transaction log only when the status confirmation request is received from the log requesting unit 31 and extracts the transaction log independently of the main processing unit 21. It is possible to check the status of the processes executed by each of the plurality of analysis servers 20 without affecting the 11 main processes.

また、状態確認要求に解析サーバ２０の各々を識別するサーバ識別情報も含めることで、異常が発生した場合にその原因となる処理を行ったと推測される解析サーバ２０の処理内容のみを確認することができ、システム全体として新たな処理を行う解析サーバ２０の数を限定して処理の負荷を抑制しつつ、原因を早急に特定することができる。 In addition, by including server identification information for identifying each of the analysis servers 20 in the status confirmation request, only the processing contents of the analysis server 20 that is presumed to have performed the cause of the processing when an abnormality occurs are confirmed. It is possible to quickly identify the cause while limiting the processing load by limiting the number of analysis servers 20 that perform new processing as the entire system.

（変形形態）
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 (Deformation)
As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

なお、上記実施形態では、ログ抽出手段２３は、状態確認要求を受け付けた場合にトランザクションログを抽出することとしているが、これに限られるものではなく、図６に示すように、設定された任意の抽出タイミングで抽出することとしてもよい。図６を参照して、変形形態のログ集約処理Ａについて説明する。 In the above embodiment, the log extracting unit 23 extracts a transaction log when a status confirmation request is received. However, the present invention is not limited to this, and as shown in FIG. It is good also as extracting at the extraction timing. A modified form of log aggregation processing A will be described with reference to FIG.

Ｓ２１：解析サーバ２０の制御部（ログ抽出手段２３）は、抽出タイミングであるか否かを判定する。抽出タイミングは、任意に設定可能であり、例えば、５分等の定期的な間隔であってよい。Ｓ２１において抽出タイミングである場合には、Ｓ２２の処理に移り、抽出タイミングでない場合には、抽出タイミングになるまでＳ２１の処理を繰り返す。 S21: The control unit (log extraction unit 23) of the analysis server 20 determines whether it is the extraction timing. The extraction timing can be arbitrarily set, and may be a regular interval such as 5 minutes. If it is the extraction timing in S21, the process proceeds to S22. If it is not the extraction timing, the process of S21 is repeated until the extraction timing is reached.

Ｓ２２：続いて、解析サーバ２０の制御部（ログ抽出手段２３）は、記憶部（トランザクションログ記憶手段２２）から、トランザクションログを抽出する。 S22: Subsequently, the control unit (log extraction unit 23) of the analysis server 20 extracts a transaction log from the storage unit (transaction log storage unit 22).

Ｓ２３：続いて、解析サーバ２０の制御部（ログ抽出手段２３）は、抽出したトランザクションログを、集約サーバ３０に送信する。 S23: Subsequently, the control unit (log extracting unit 23) of the analysis server 20 transmits the extracted transaction log to the aggregation server 30.

Ｓ２４：続いて、集約サーバ３０の制御部（ログ集約手段３２）は、受信したトランザクションログを、当該トランザクションログに対応付けられたＵＲＬ、及び受信した解析サーバ２０の種別を識別するサーバ識別情報に対応付けて集約する。 S24: Subsequently, the control unit (log aggregating unit 32) of the aggregation server 30 converts the received transaction log into the URL associated with the transaction log and the server identification information that identifies the type of the received analysis server 20. Associate and aggregate.

これにより、複数の解析サーバ２０により順次行われる処理において指定したＵＲＬをキーとして集約サーバ３０に集約されたトランザクションログを予め設定した任意のタイミングで確認することで、指定したＵＲＬに対する処理の状態を監視することができる。 As a result, by confirming the transaction log aggregated in the aggregation server 30 using the URLs specified in the processing sequentially performed by the plurality of analysis servers 20 as a key, the processing status for the specified URLs can be confirmed. Can be monitored.

なお、図６では、ログ抽出手段２３がトランザクションログを抽出するタイミングを任意のタイミングにすることとしているが、ログ抽出手段２３が集約サーバ３０にトランザクションログを送信するタイミングを任意のタイミングとしてもよい。具体的には、トランザクションログ記憶手段２２にトランザクションログが記憶されるたびに、ログ抽出手段２３が当該トランザクションログを抽出し、その後、所定の周期（例えば、５分間隔）で集約サーバ３０にトランザクションログを送信することとしてもよい。 In FIG. 6, the timing at which the log extraction unit 23 extracts the transaction log is set to an arbitrary timing, but the timing at which the log extraction unit 23 transmits the transaction log to the aggregation server 30 may be set to an arbitrary timing. . Specifically, each time a transaction log is stored in the transaction log storage unit 22, the log extraction unit 23 extracts the transaction log, and thereafter, the transaction is sent to the aggregation server 30 at a predetermined cycle (for example, every 5 minutes). A log may be transmitted.

１検索エンジン
１０Ｗｅｂクローラ
１１データ解析部
１２インデクサ
１３検索サーバ
２０解析サーバ
２１メイン処理手段
２２トランザクションログ記憶手段
２３ログ抽出手段
３０集約サーバ
３１ログ要求手段
３２ログ集約手段 DESCRIPTION OF SYMBOLS 1 Search engine 10 Web crawler 11 Data analysis part 12 Indexer 13 Search server 20 Analysis server 21 Main processing means 22 Transaction log storage means 23 Log extraction means 30 Aggregation server 31 Log request means 32 Log aggregation means

Claims

A web crawler that circulates web pages and collects crawl data, a data analysis unit that creates data used for a search index from the crawl data collected by the web crawler, and data created by the data analysis unit A search engine comprising: an indexer that creates a search index, and a search server that responds to a search request from a user using the search index created by the indexer;
The data analysis unit includes a plurality of servers that sequentially process the crawl data to create data used for the search index, and an aggregation server that aggregates the processing results of the plurality of servers,
Each of the plurality of servers is
Transaction log storage means for storing the transaction log associated with the process in association with the URL of the Web page corresponding to the crawl data each time the process is executed;
A log extraction unit that extracts a transaction log associated with the URL from the transaction log storage unit and sends the transaction log to the aggregation server in response to receiving a status confirmation request that specifies the URL;
With
The aggregation server is a search engine comprising log aggregation means for aggregating a plurality of received transaction logs.

The status confirmation request includes designation of server identification information for identifying the plurality of servers in addition to designation of the URL ,
The search according to claim 1, wherein the log extraction unit extracts a transaction log associated with a key included in the state confirmation request from a transaction log storage unit provided in a server identified by the server identification information . Engine .

A web crawler that circulates web pages and collects crawl data, a data analysis unit that creates data used for a search index from the crawl data collected by the web crawler, and data created by the data analysis unit A search engine comprising: an indexer that creates a search index, and a search server that responds to a search request from a user using the search index created by the indexer;
The data analysis unit includes a plurality of servers that sequentially process the crawl data to create data used for the search index, and an aggregation server that aggregates the processing results of the plurality of servers,
Each of the plurality of servers is
Transaction log storage means for storing the transaction log associated with the process in association with the URL of the Web page corresponding to the crawl data each time the process is executed;
Log extraction means for extracting a transaction log associated with a specific URL from transaction logs stored in each of the transaction log storage means at a predetermined cycle, and sending the log to the aggregation server;
With
A search engine comprising log aggregation means, wherein the aggregation server aggregates the extracted transaction logs in association with the specific URL and server identification information for identifying the plurality of servers.

A web crawler circulates web pages and collects crawl data, a data analysis unit creates data used for a search index from the crawl data collected by the web crawler, and an indexer creates the data by the data analysis unit. A search method for creating a search index based on the data and responding to a search request from a user using the search index created by the indexer by a search server,
The data analysis unit sequentially processes the crawl data in each of a plurality of servers to create data used for the search index, and aggregates the processing results of the plurality of servers in an aggregation server,
In each of the plurality of servers,
Each time the transaction log storage means executes a process, the transaction log associated with the process is stored in association with the URL of the web page corresponding to the crawl data;
In response to receiving a status confirmation request designating the URL, the log extraction means extracts a transaction log associated with the URL from the transaction log storage means, and sends the transaction log to the aggregation server;
And a log aggregating unit in the aggregation server aggregating the received plurality of transaction logs .