JP2013235515A

JP2013235515A - Data distribution management system, apparatus, method and program

Info

Publication number: JP2013235515A
Application number: JP2012108849A
Authority: JP
Inventors: Kimihiro Mizutani; 后宏水谷; Osamu Akashi; 修明石; Kensuke Fukuda; 健介福田
Original assignee: Nippon Telegraph and Telephone Corp; Research Organization of Information and Systems
Current assignee: Nippon Telegraph and Telephone Corp; Research Organization of Information and Systems
Priority date: 2012-05-10
Filing date: 2012-05-10
Publication date: 2013-11-21
Anticipated expiration: 2032-05-10
Also published as: JP5818263B2

Abstract

PROBLEM TO BE SOLVED: To enable shortening of an analysis time and analysis of continuous data on a structured overlay.SOLUTION: A system having a plurality of nodes on a structured overlay network performs control so that a management module of a structured overlay makes loads of data that the nodes have uniform between nodes; performs structuring so that connection between nodes becomes a balance tree in order to analyze the uniformed data; accumulates data transmitted to the overlay network in a management module of MapReduce processing; manages an analysis scheme for analyzing the data accumulated in data accumulation means; transmits the analysis result of the analysis scheme to other nodes; and combines the analysis results.

Description

本発明は、データの分散管理システム及び装置及び方法及びプログラムに係り、特に、複数のメタデータ（行動履歴、RFID(Radio Frequency Identification)、センサ情報）を分散管理する構造化オーバレイネットワーク上において、時系列データなどの連続データの効率的な管理と、当該連続データの解析の並列性を最大化するためのデータの分散管理システム及び装置及び方法及びプログラムに関する。 The present invention relates to a data distribution management system, apparatus, method, and program, and in particular, on a structured overlay network that distributes and manages a plurality of metadata (action history, RFID (Radio Frequency Identification), sensor information). The present invention relates to a data distribution management system, apparatus, method, and program for maximizing parallel management of continuous data such as series data and parallel analysis of the continuous data.

近年、ビッグデータと呼ばれる莫大なWebの情報やセンサ情報を解析することで、Webの閲覧履歴から検索広告のパーソナライゼーションを行ったり、自動車のセンサ情報を集約し解析することで渋滞を予想することが可能になった（例えば、非特許文献1参照）。しかし、ビッグデータはあまりに膨大なため、単一計算機での解析は不可能とされている。従って、ビッグデータを効率的に解析するためには、分散処理が不可欠となり、多くの分散処理モデルが提案されている（例えば、非特許文献2，3参照）。その中でもMapReduceと呼ばれる概念は、規模拡張性に優れ、実装の容易さや分散処理を行うためのスキームの定義が柔軟であるという点から、ビッグデータの解析に幅広く応用されるようになった（例えば、非特許文献4，5参照）。MapReduceは、大きくMapとShuffleとReduceの処理に分かれており、各処理における動作は互いに独立しているため、並列的に解析を行うことができる。Map処理ではビッグデータをある計算単位に分割し、Shuffle処理では分割したデータを任意のノード（計算機資源）に統合し、Reduce処理では集約したデータの統計処理などを行う。 In recent years, by analyzing vast web information and sensor information called big data, we can personalize search advertisements from web browsing history, and predict traffic congestion by collecting and analyzing automobile sensor information. (For example, see Non-Patent Document 1). However, big data is so huge that it is impossible to analyze it with a single computer. Therefore, in order to efficiently analyze big data, distributed processing is indispensable, and many distributed processing models have been proposed (for example, see Non-Patent Documents 2 and 3). Among them, the concept called MapReduce has been widely applied to big data analysis because it is highly scalable and has a flexible definition of schemes for distributed processing (for example, Non-Patent Documents 4 and 5). MapReduce is broadly divided into Map, Shuffle, and Reduce processes, and the operations in each process are independent of each other, so analysis can be performed in parallel. In map processing, big data is divided into certain calculation units, shuffle processing integrates the divided data into arbitrary nodes (computer resources), and reduce processing performs statistical processing of aggregated data.

しかし、既存のMapReduceを用いた情報解析システムには大きな問題が二つある。一つは、Map処理によって分割されたデータのハッシュ値を用いて、Shuffle処理を行うことにより、分割されたデータの連続性を消失してしまう点である。MapReduceでは、連続性があるデータが、特定のノードに対して集中する可能性を考慮し、各データのハッシュ値を用いてShuffleを行う事で、後のReduce作業における処理の偏りを解消する狙いがある。このため連続性のあるデータを集約して、前後関係を参照するような解析をすることは難しい。 However, there are two major problems with the existing information analysis system using MapReduce. One is that the continuity of the divided data is lost by performing the Shuffle processing using the hash value of the data divided by the Map processing. In MapReduce, considering the possibility that data with continuity will be concentrated on a specific node, the aim is to eliminate processing bias in subsequent Reduce operations by performing a Shuffle using the hash value of each data There is. For this reason, it is difficult to perform analysis that aggregates continuous data and refers to the context.

もう一つの問題は、再帰的なMapReduceを行うことが難しいことである。MapReduceによって、データを集約・解析を行った結果を再度、他のノードに対してMapReduceを行う際には、解析結果を集約するための集約ノードの選定を行わなければならない。再帰的なMapReduceの計算効率（並列性）を最大化するためには、全てのノードが平衡木の集約構造を維持しなければならない。しかし、平衡木の集約構造を持たせることで、再帰的な解析を行う際の負荷をノード間で均一化することができるが、平衡木のメンテナンスコストは非常に高くなるという問題がある。具体的には、平衡木を構築する際に、木の左右のバランスをとるために、何度も木を再構築しなければならないという問題点がある。 Another problem is that it is difficult to do recursive MapReduce. When MapReduce is performed again on the results of data aggregation / analysis by MapReduce for other nodes, an aggregation node for aggregating the analysis results must be selected. In order to maximize the computational efficiency (parallelism) of recursive MapReduce, all nodes must maintain a balanced tree aggregation structure. However, by providing an aggregated structure of balanced trees, the load for performing recursive analysis can be made uniform among nodes, but there is a problem that the maintenance cost of the balanced tree becomes very high. Specifically, when building a balanced tree, there is a problem that the tree must be rebuilt many times in order to balance the left and right sides of the tree.

以下では、MapReduceを用いた既存の情報解析システムの中でも、MapReduceを再帰的に行うことが可能な"Jubatus"（登録商標）と構造化オーバレイを用いて規模拡張性に優れた情報解析システムを達成している"ddd"について述べ、各システムが再帰的な解析と連続データの管理を同時に満たすことができない理由を述べる（例えば、非特許文献７，８参照）。 In the following, among the existing information analysis systems using MapReduce, "Jubatus" (registered trademark) that can perform MapReduce recursively and a structured overlay are used to achieve an information analysis system with excellent scalability. The reason why each system cannot satisfy recursive analysis and continuous data management at the same time is described (for example, see Non-Patent Documents 7 and 8).

"Jubatus"（登録商標）はOpen Sourceとして提供されているMapReduceを用いた情報解析システムである。"Jubatus"（登録商標）のアーキテクチャを図１に示す。"Jubatus" （登録商標）では、Map機能を行うクライアントに対して、"JubaKeeper"というMap機能とShuffle機能を提供している。クライアントが、形態素解析などのジョブを"JubaKeeper"に送ると、"JubaKeeper"はMap処理とShuffle処理を行い、Reduceを担当する"JubaKeeper"に対して、データを転送する。"Jubatus"（登録商標）では、"JubaClassifier"における解析スキームをクライアントが記述することができるため、"JubaClassifier"間でReduce結果を集約（MIX処理）する事で、再帰的な解析ができる。なお、MIX処理を行う際には、全てのノードにおける一貫性や保守を一元管理している"ZooKeeper"を通して、MIX処理の対象となるノードの資源をロックした上で、データのMIX処理を行う。なおロック中に、当該ノードに対して送られてきたデータは破棄される。"Jubatus"（登録商標）における集約ノードの管理は、"ZooKeeper"に一任しているため、"ZooKeeper"が単一障害点になりえてしまうという問題点があり、規模拡張性は薄い。また、解析データとして連続データを扱うことも可能であるが、連続データに対してMIX処理を行う際の集約ノードの選定は、クライアントの解析スキームに依存する。また、各ノードの状態やKeyの全体の分布などが判明しない限りは、集約構造を平衡木として扱うことは難しく、解析の並列性を最大化できるとは限らない。 “Jubatus” (registered trademark) is an information analysis system using MapReduce provided as Open Source. The architecture of “Jubatus” (registered trademark) is shown in FIG. "Jubatus" (registered trademark) provides a Map function called "JubaKeeper" and a Shuffle function to clients that perform the Map function. When a client sends a job such as morphological analysis to "JubaKeeper", "JubaKeeper" performs Map processing and Shuffle processing, and transfers data to "JubaKeeper" in charge of Reduce. In “Jubatus” (registered trademark), since the analysis scheme in “JubaClassifier” can be described by the client, recursive analysis can be performed by aggregating Reduce results (mix processing) between “JubaClassifier”. When performing MIX processing, the resource of the node subject to MIX processing is locked through “ZooKeeper” which manages consistency and maintenance in all nodes in a unified manner, and then data MIX processing is performed. . Note that data sent to the node during the lock is discarded. Since management of the aggregation node in “Jubatus” (registered trademark) is left to “ZooKeeper”, there is a problem that “ZooKeeper” can be a single point of failure, and the scalability of the scale is low. Although continuous data can be handled as analysis data, the selection of an aggregation node when performing MIX processing on continuous data depends on the analysis scheme of the client. Unless the state of each node and the distribution of the entire key are known, it is difficult to handle the aggregate structure as a balanced tree, and the parallelism of analysis cannot always be maximized.

"Jubatus"（登録商標）に対して"ddd"は、全てのノードを構造化オーバレイによって管理することにより、MapReduceの規模拡張性を向上させている。"ddd"のアーキテクチャを図２に示す。"ddd"では、ハッシュ化したデータに対してShuffle処理を行い、Consistent Hashing法に基づく構造化オーバレイ上のノードに配置するため、連続データを扱うことができない（例えば、非特許文献９参照）。ただ、"ddd"にて連続データを扱うことは可能である。連続データを扱う際には、データによって負荷が偏ってしまう問題がある。この問題を解決するために、"ddd"ではノード間で発生する負荷の偏りを、バーチャルノードを用いて解決している。バーチャルノードを用いた負荷分散は、構造化オーバレイ上の各ノードが、自身の仮想的なノードを多数生成し、構造化オーバレイ上のID空間を仮想ノードで埋め尽くす事によって、各ノードに対する負荷を均一化する。バーチャルノードによる負荷分散を用いた場合、入力データが連続データであったとしても、構造化オーバレイ上で負荷を均一にすることができる。ただ、バーチャルノードの数は予め決める必要があるため、連続データの分布によっては、必ずしも負荷が平均化されるわけではない。さらに、再帰的なReduce処理を行う際には、ユーザがReduceの再帰処理を記述しなければならないため、"Jubatus"（登録商標）と同様の理由により、平衡木構造を構築することが難しい。 In contrast to “Jubatus” (registered trademark), “ddd” improves the scale extensibility of MapReduce by managing all nodes with a structured overlay. The architecture of “ddd” is shown in FIG. In “ddd”, shuffle processing is performed on the hashed data, and the data is arranged at a node on the structured overlay based on the consistent hashing method, so that continuous data cannot be handled (see, for example, Non-Patent Document 9). However, it is possible to handle continuous data with "ddd". When handling continuous data, there is a problem that the load is biased depending on the data. In order to solve this problem, “ddd” solves the load imbalance between nodes using a virtual node. In load balancing using virtual nodes, each node on the structured overlay generates a large number of its own virtual nodes and fills the ID space on the structured overlay with virtual nodes, thereby reducing the load on each node. Make uniform. When load distribution by virtual nodes is used, even if the input data is continuous data, the load can be made uniform on the structured overlay. However, since the number of virtual nodes needs to be determined in advance, the load is not necessarily averaged depending on the distribution of continuous data. Furthermore, when performing recursive Reduce processing, the user must describe Reduce recursive processing, and it is difficult to construct a balanced tree structure for the same reason as “Jubatus” (registered trademark).

S4 Project, http://incubator.apache.org/s4/S4 Project, http://incubator.apache.org/s4/ Leslie G Valiant, "A Bridging Model for Parallel Computation," Communications of the ACM, Volume 33 Issue 8, Aug. 1990.Leslie G Valiant, "A Bridging Model for Parallel Computation," Communications of the ACM, Volume 33 Issue 8, Aug. 1990. http://www.cs.berkeley.edu/matei/spark/http://www.cs.berkeley.edu/matei/spark/ Hadoop, http://lucene.apache.org/hadoop/, 2007.Hadoop, http://lucene.apache.org/hadoop/, 2007. J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in OSDI 2004: 6th Symposium on Operating Systems Design and Implementation, 2004.J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in OSDI 2004: 6th Symposium on Operating Systems Design and Implementation, 2004. P. William, "Skip Lists: A Probabilistic Alternative to Balanced Trees," Communications of the ACM, 1990.P. William, "Skip Lists: A Probabilistic Alternative to Balanced Trees," Communications of the ACM, 1990. Jubatus, http://jubat.us/Jubatus, http://jubat.us/ ddd, http://www.iij.ad.jp/development/tech-activities/detail/ddd.htmlddd, http://www.iij.ad.jp/development/tech-activities/detail/ddd.html D. Karger, et al, "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web," in Proc. ACM Symposium on Theory of Computing, 1997.D. Karger, et al, "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web," in Proc. ACM Symposium on Theory of Computing, 1997.

このように、MapReduceにおける規模拡張性や再帰的なReduce処理に関する基礎検討は行われているが、これらの処理の最適化手法には言及されていない。つまり、規模拡張性や並列性に富み、連続データに対しても効率的な運用管理を達成できる情報解析システムを構築することは既存技術では困難と考えられる。 As described above, basic studies on scale extensibility and recursive Reduce processing in MapReduce have been performed, but no optimization method for these processing is mentioned. In other words, it is considered difficult to build an information analysis system that is rich in scale and parallelism and that can achieve efficient operation management even for continuous data.

本発明は、上記の点に鑑みなされたもので、構造化オーバレイ上における、解析時間の短縮と連続データの解析が可能なデータの分散管理システム及び装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and an object thereof is to provide a data distribution management system, apparatus, method, and program capable of reducing analysis time and analyzing continuous data on a structured overlay. To do.

上記の課題を解決するため、本発明（請求項１）は、複数のメタデータを分散管理する構造化オーバレイネットワークにおけるデータの分散管理システムであって、
構造化オーバレイネットワーク上に複数のノードを有し、
前記ノードは、
構造化オーバレイ管理手段と、MapReduce処理管理手段と、を有し、
前記構造化オーバレイ管理手段は、
前記複数のノードが持つデータをノード間で負荷が均一化するように制御する負荷分散手段と、
前記負荷分散手段によって均一化されたデータを解析するために、各ノード間の接続が平衡木となるように構造化する構造化手段と、を有し、
前記MapReduce処理管理手段は、
前記オーバレイネットワークに対して発信されるデータを蓄積するデータ蓄積手段と、
前記データ蓄積手段に蓄積されたデータを解析するための解析スキームを管理する解析管理手段と、
前記解析スキームによる解析結果を他ノードに対して送信し、解析結果を組み合わせる解析結果統合手段と、を有する。 In order to solve the above problems, the present invention (Claim 1) is a distributed data management system in a structured overlay network that distributes and manages a plurality of metadata.
Having multiple nodes on a structured overlay network;
The node is
Having structured overlay management means and MapReduce processing management means,
The structured overlay management means includes:
Load balancing means for controlling the data held by the plurality of nodes so that the load is uniform among the nodes;
Structuring means for structuring the connections between the nodes to be balanced trees in order to analyze the data that has been made uniform by the load balancing means,
The MapReduce process management means
Data storage means for storing data transmitted to the overlay network;
Analysis management means for managing an analysis scheme for analyzing data stored in the data storage means;
Analysis result integration means for transmitting an analysis result according to the analysis scheme to another node and combining the analysis results.

また、本発明（請求項２）は、前記MapReduce処理管理手段において、
ノード間の接続関係を平衡木の木構造を維持し、該平衡木の木構造にしたがって、複数のメタデータを計算単位のデータに分割するMap処理、分割したデータを担当するノードに転送するShuffle処理、各ノード装置の解析結果を集約して処理するReduce処理を行うMapReduce処理手段を含む。 The present invention (Claim 2) provides the MapReduce process management means,
Maintaining the tree structure of the balanced tree for the connection relationship between nodes, Map processing for dividing a plurality of metadata into data of calculation units according to the tree structure of the balanced tree, Shuffle processing for transferring the divided data to the node in charge, It includes MapReduce processing means for performing Reduce processing for collecting and processing the analysis results of each node device.

また、本発明（請求項３）は、前記負荷分散手段において、
自ノードが根ノードである場合に、前記平衡木の木構造に従って集約した各ノードの負荷情報に基づいて、所定の判別式により負荷が低いノード、または、負荷が高いノードを判別し、負荷が高いノードが管理するID空間の中からノード間で負荷が均一になるようなハッシュＩＤ（HashID）を選択し、負荷が低いノードに対して選択したHashIDを追加する手段を含む。 Further, the present invention (Claim 3) provides the load balancing means,
When the own node is a root node, a node having a low load or a node having a high load is determined by a predetermined discriminant based on the load information of each node aggregated according to the tree structure of the balanced tree, and the load is high. A means for selecting a hash ID (HashID) that makes the load uniform among the nodes from the ID space managed by the node, and adding the selected HashID to a node having a low load;

上記のように本発明によれば、規模拡張性や並列性に富むと共に、時系列データなどの連続データに対しても、各ノード装置が持つデータ量を均一化することができ、効率的な運用管理を達成できる情報解析システムが実現できる。 As described above, according to the present invention, the data capacity of each node device can be made uniform with respect to continuous data such as time-series data, as well as being highly scalable and efficient. An information analysis system that can achieve operational management can be realized.

"Jubatus"(登録商標)のアーキテクチャである。It is an architecture of “Jubatus” (registered trademark). "ddd"のアーキテクチャである。"ddd" architecture. 本発明の一実施の形態におけるSkipListを用いた構造化オーバレイ上におけるMapReduceの動作概要である。It is an operation | movement outline | summary of MapReduce on the structured overlay using SkipList in one embodiment of this invention. 本発明の一実施の形態におけるノード構成図である。It is a node block diagram in one embodiment of this invention. 本発明の一実施の形態における解析データの集約構造である。It is the aggregation structure of the analysis data in one embodiment of this invention. 本発明の一実施の形態におけるルーティングテーブルの構築処理のシーケンスチャートである。It is a sequence chart of the construction process of the routing table in one embodiment of this invention. 本発明の一実施の形態における負荷分散手法の概要である。It is the outline | summary of the load distribution method in one embodiment of this invention. 本発明の一実施の形態における負荷分散のシーケンスチャートである。It is a sequence chart of load distribution in an embodiment of the present invention. 本発明の一実施の形態におけるShuffle処理の検索手順である。It is a search procedure of Shuffle processing in one embodiment of the present invention. 本発明の一実施の形態における解析スキームの配布と実行手順を示す図である。It is a figure which shows the distribution and execution procedure of the analysis scheme in one embodiment of this invention. 本発明の一実施の形態における負荷の均一化処理のシーケンスチャートである。It is a sequence chart of the load equalization process in one embodiment of the present invention. 本発明の一実施の形態における解析プロセスのシーケンスチャートである。It is a sequence chart of the analysis process in one embodiment of the present invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

本発明は、Publisher/Subscriberシステムと類似している。Publisher/Subscriberシステムとは、メールマガジンなどの配信サービスで用いるシステムであり、クライアント（Subscriber）がサーバ（Publisher）に対して購読登録用インタフェースを通して、購読したいメールマガジンを登録しておく。サーバがメールマガジンを発行する際には、当該メールマガジンの購読者に対して、Push型配信を用いてメールマガジンを発行する。Publisher/Subscriberシステムを本発明に当てはめると、様々なセンサやモニタから定常的に発信（Publish）されるデータをSkipListを用いた構造化オーバレイ上で蓄積（Shuffle）し続ける。ユーザが蓄積したデータに対して何らかの解析を行う時、ユーザは自身が記述した解析スキームをSkipListに対してSubscribe（発行）することによって、一時的・定期的に並列解析を行った結果を得ることができる。 The present invention is similar to the Publisher / Subscriber system. The Publisher / Subscriber system is a system used in a delivery service such as a mail magazine. A client (Subscriber) registers a mail magazine to be subscribed to a server (Publisher) through a subscription registration interface. When the server issues a mail magazine, the mail magazine is issued to the subscriber of the mail magazine using push-type distribution. When the Publisher / Subscriber system is applied to the present invention, data that is constantly transmitted (Publish) from various sensors and monitors is continuously accumulated (Shuffle) on the structured overlay using SkipList. When performing some kind of analysis on the data accumulated by the user, the user obtains the result of performing parallel analysis temporarily and periodically by subscribing the analysis scheme described by the user to the SkipList. Can do.

図３にSkipListを用いた構造化オーバレイ上におけるMapReduceの動作概要を示す。SkipList上で動作するノードに対して、大量のデータが到着し、当該データをSkipList上でShuffle（蓄積）を行う。さらに、蓄積されたデータを、SkipListにおける平衡木に従って再帰的にReduce処理を行う。 FIG. 3 shows an outline of MapReduce operation on a structured overlay using SkipList. A large amount of data arrives at a node operating on SkipList, and the data is shuffled (stored) on SkipList. Further, the accumulated data is recursively reduced according to the balanced tree in SkipList.

また、図４にSkipListを用いたMapReduceを達成するために必要な装置構成を示す。 FIG. 4 shows a device configuration necessary to achieve MapReduce using SkipList.

MapReduce処理管理部(MapReduce処理の管理モジュール)１００はユーザが発行する解析スキームを保存するアナリシススキームバッファ（Analysis Scheme Buffer）１１０、解析結果の同期部（Analysis Scheme Coordinator）１２０、連続データを蓄積するデータバッファ（Data Buffer）１３０が実装されている。なお、MapReduce処理管理部１００から発行される解析スキームの伝搬には構造化オーバレイ管理部２００が提供するAPI（Scatter API）を利用する。 The MapReduce processing management unit (MapReduce processing management module) 100 is an analysis scheme buffer (Analysis Scheme Buffer) 110 for storing an analysis scheme issued by a user, an analysis result synchronization unit (Analysis Scheme Coordinator) 120, and data for storing continuous data A buffer (Data Buffer) 130 is mounted. Note that an API (Scatter API) provided by the structured overlay management unit 200 is used for propagation of an analysis scheme issued from the MapReduce processing management unit 100.

構造化オーバレイ管理部(構造化オーバレイの管理モジュール)２００では、SkipListの構築や維持を行う機能を有するSkipListマネージャ（SkipListManager）２１０やSkipList上で負荷分散を行う機能を有する負荷分散部（Load Balancing Executor）２２０が実装されている。なお、通信部３００ではTCPプロトコルを用いたデータの送信部（Sender）３１０や受信部（Receiver）３２０が実装されている。なお、通信部３００はOSに実装されているTCP Socketを利用しデータの送受信を行うためだけのモジュールなので、説明は省略する。 In the structured overlay management unit (structured overlay management module) 200, a SkipList manager (SkipListManager) 210 having a function for building and maintaining a SkipList and a load balancing unit (Load Balancing Executor having a function for performing load distribution on the SkipList) ) 220 is implemented. The communication unit 300 includes a data transmission unit (Sender) 310 and a reception unit (Receiver) 320 using the TCP protocol. Note that the communication unit 300 is a module only for transmitting and receiving data using a TCP socket installed in the OS, and thus description thereof is omitted.

以下では、SkipListを用いた集約構造とその構築手順を説明し、連続データに対する負荷の均一化手法についても述べる。また、解析スキームをSubscribeする方法についても簡潔に述べる。 In the following, the aggregation structure using SkipList and its construction procedure are described, and a method for equalizing the load on continuous data is also described. We will also briefly describe how to subscribe analysis schemes.

最初に、構造化オーバレイ管理部２００について説明する。 First, the structured overlay management unit 200 will be described.

構造化オーバレイ管理部２００は、SkipListマネージャ（SkipList Manager）２１０と負荷分散部２２０を有する。 The structured overlay management unit 200 includes a SkipList manager 210 and a load distribution unit 220.

SkipListマネージャ２１０は、ルーティングバッファ２１１を具備している。 The SkipList manager 210 includes a routing buffer 211.

SkipListは平衡木の一種で、AVL木やB木などと異なり、木の深さを一定にするための処理（木の再構成）を行わない（例えば、非特許文献６参照）。そのため、容易に平衡木を構築することができるため、MapReduceを用いた解析の並列性を容易に向上させることができる。本構造をノード間の集約構造に適用するために、まず各ノードに対して2つのIDを持たせる。一つは、ハッシュ値となっており、"ddd"と同様にConsistent Hashingを行うために利用する。もう一つのIDはレベルというもので、このレベルに従ってノードの集約構造を階層化し、ノード同士の関係性を木構造にする。例えば、平衡N分木を構築する場合、各ノードがレベルiに存在する確率は(1/N)ⁱとなり、ｉ以下のレベルにも同時に属する。従って、各ノードのレベルは乱数RAND（0≦RAND＜1）を用いて、 SkipList is a type of balanced tree, and unlike AVL trees and B-trees, SkipList does not perform processing (tree reconstruction) to keep the tree depth constant (see Non-Patent Document 6, for example). Therefore, since a balanced tree can be easily constructed, parallelism of analysis using MapReduce can be easily improved. In order to apply this structure to the aggregate structure between nodes, first, each node has two IDs. One is a hash value, which is used to perform consistent hashing like "ddd". Another ID is a level, and the node aggregation structure is hierarchized according to this level, and the relationship between nodes is made into a tree structure. For example, when a balanced N-ary tree is constructed, the probability that each node exists at level i is (1 / N) ⁱ , and belongs to the levels below i at the same time. Therefore, the level of each node uses a random number RAND (0 ≦ RAND <1),

と計算できる。[]はガウス記号で、[a]はaを超えない最大の整数を意味する。なお以下では[a， b]や[a， b)のような表現が頻出するが、[a， b]はa以上b以下を意味し、[a， b)はa以上bより小さいを意味する。さらに、(a， b]はaより大きくb以下を意味し、(a， b)はaより大きくbより小さいことを意味する。

Can be calculated. [] Is a Gaussian symbol, and [a] means the largest integer that does not exceed a. In the following, expressions such as [a, b] and [a, b) appear frequently, but [a, b] means from a to b, and [a, b) means from a to less than b. To do. Furthermore, (a, b] means greater than a and less than b, and (a, b) means greater than a and less than b.

レベルiのノードpのハッシュＩＤ（HashID: p）は、レベルi以下の各レベルにて、時計回り・反時計回りの隣接ノードをSuccessor、Predecessorとして当該ノード情報（HashIDとIP Address）をルーティングバッファ２１１内のルーティングテーブルに記録する。以下では、ノードpのレベルk（0≦k≦i）におけるSuccessorのHashIDをs_p，kとし、レベルkのPredecessorのHashIDをp_p，kとする。また、ノードpが担当するID空間は、p_p，0＜pを満たす時、(p_p，0， p]とする。また、p＜p_p，0を満たす時、ノードpが担当するID空間は(p_p，0， L)、[0， p]とする（LはID空間全体のサイズであり、HashID: LはHashID: 0と同値となっており、ID空間はリング型の構造になっている）。なお、以下では上記の範囲を纏めて、(p_p，0， p]とする。 The hash ID (HashID: p) of the node i at level i is the routing buffer for the node information (HashID and IP Address) as the successor and predecessor in the clockwise and counterclockwise adjacent nodes at each level below the level i. Record in the routing table in 211. In the following, it is assumed that the HashID of the successor at the level k (0 ≦ k ≦ i) of the node p is sp _{, k,} and the HashID of the predecessor at the level k is p _{p, k} . The ID space handled by the node p is (p _{p, 0} , p] when p _{p, 0} <p is satisfied, and the ID assigned by the node p when p <p _{p, 0} is satisfied. The space is ( p _{p, 0} , L), [0, p] (L is the size of the entire ID space, HashID: L is equivalent to HashID: 0, and the ID space is a ring-type structure. In the following, the above ranges are summarized as (p _{p, 0} , p].

なお、ノードpが担当するID空間とは、Shuffle処理の際に配置されるデータのHashIDの範囲を意味する。例えば、ノードpが担当するID空間が10から100とする。この時、HashIDが45や76のデータはノードpが保持することになる。 Note that the ID space handled by the node p means a range of HashIDs of data arranged in the Shuffle process. For example, the ID space handled by the node p is 10 to 100. At this time, data with HashID 45 and 76 is held by the node p.

さらに、レベルk（1≦k≦i）にて、HashIDが[p， s_p，k)を満たすレベルk-1のノード群を子ノードとして当該ノード情報を保持し、レベルiで自身のHashIDであるpを子として保持しているノードを親としてノード情報をルーティングバッファ２１１に保持する。なお、ノードpが担当するID空間と子ノードのID空間を逆方向に定める理由は、子ノードが担当するID空間とノードpが担当するID空間が重って冗長になることを避けるためである。ノード数をN^q（0＜N、0≦q）とした時、LEVEL＞qを満たす場合、レベルはqとする。なお、レベルqのノードを根ノードとし、SkipList上の分散計算を統括する役割を果たす。レベルqに属する根ノードが複数存在する場合、レベルqにおけるSuccessorのHashIDが自身のHashIDよりも小さいと判断したノードが、レベルq+1に所属することにし、レベルqのノード群を子に持つ。なお、これらの子ノード、親ノード、ルーティングテーブル（SuccessorとPredecessorノードの情報）はルーティングバッファ２１１に保存される。 Furthermore, at level k (1 ≦ k ≦ i), the node information of level k−1 satisfying [p, s _{p, k} ) whose HashID satisfies [p, s _{p, k} ) is held as a child node, and its own HashID at level i Node information is held in the routing buffer 211 with a node holding p as a child as a parent. The reason why the ID space handled by the node p and the ID space of the child node are determined in the opposite direction is to prevent the ID space handled by the child node and the ID space handled by the node p from overlapping and redundant. is there. When the number of nodes is N ^q (0 <N, 0 ≦ q), the level is q if LEVEL> q is satisfied. Note that the node at level q is the root node and plays the role of overseeing distributed computation on SkipList. If there are multiple root nodes belonging to level q, the node that has been judged that the successor's HashID at level q is smaller than its own HashID will belong to level q + 1, and will have a node group of level q as its children . These child nodes, parent nodes, and routing tables (information of successor and predecessor nodes) are stored in the routing buffer 211.

上記のような解析データの集約構造の例を図５に示す。 An example of the aggregated structure of analysis data as described above is shown in FIG.

図５では、10台のノードから構成されるSkipList構造を示しており、各ノードの番号はHashIDを表している。ノード2はレベル0にてノード1，3を、レベル1も同様にノード0，4をSuccessor、Predecessorノードとしてルーティングテーブルに記録している。さらに、ノード2はレベル0のノード3を子ノードとして記録しており、レベル2のノード0を親ノードとして記録している。ノード2と同様に、ノード0もルーティングテーブル、親ノード、子ノードの情報を保持している。なお、レベル2において、ノード0と隣接するノードが存在しないため、当該ノードが根ノードとなっている。 FIG. 5 shows a SkipList structure composed of 10 nodes, and the number of each node represents a HashID. Node 2 records nodes 1 and 3 at level 0, and level 1 also records nodes 0 and 4 as successors and predecessor nodes in the routing table. Further, node 2 records level 0 node 3 as a child node, and records level 2 node 0 as a parent node. Similar to the node 2, the node 0 also holds the routing table, parent node, and child node information. Note that at level 2, there is no node adjacent to node 0, so that node is the root node.

次に、SkipListの具体的な構築方法を述べる。なお、ノードがSkipList構造に加入する処理のことを"Join処理"と定義する。Join処理は、ノードpが担当するID空間を決定し、親ノード、子ノードを決定し、ルーティングバッファ２１１内にルーティングテーブルを構築するための手順を意味する。 Next, a specific method for constructing SkipList will be described. Note that a process in which a node joins a SkipList structure is defined as a “Join process”. The Join process means a procedure for determining the ID space handled by the node p, determining a parent node and a child node, and constructing a routing table in the routing buffer 211.

図６は、本発明の一実施の形態におけるルーティングテーブルの構築処理のシーケンスチャートである。 FIG. 6 is a sequence chart of a routing table construction process according to an embodiment of the present invention.

まず、ノードP のSkipListマネジャ２１０(P)は、自身のHashIDであるpを、SkipList上に存在している任意のノードに対して、pを担当するノードを探索するメッセージ（Joinメッセージ）を送る（ステップ１０１）。当該メッセージを受けたレベルaのノードbのSkipListマネージャ２１０(ａ)は、自身のルーティングテーブルに記録している全てのSuccessorのノード情報に対して、[b， s_{b, i})（0≦i≦a）の範囲にpが含まれているかどうかを調べる（ルーティングテーブルの上位から調べる）（ステップ１０２）。[b， s_{b, i})にpが含まれていた場合、当該範囲の子ノードの中で、pに最も近く、pを超えないHashIDを持つ子ノードに対して、当該メッセージを転送する（ステップ１０３）。含まれていない場合、親ノードに対して、当該メッセージを転送し、親ノードに対してJoin処理を委託する（ステップ１０４）。 First, the SkipList manager 210 (P) of the node P sends a message (Join message) for searching for a node in charge of p, which is its HashID, to any node existing on the SkipList. (Step 101). Upon receiving the message, the SkipList manager 210 (a) of the node a at level a receives (b, s _{b, i} ) (0 ≦ i) for all successor node information recorded in its routing table. It is checked whether p is included in the range of ≦ a) (checked from the top of the routing table) (step 102). If p is included in [b, s _{b, i} ), the message is forwarded to the child node with the HashID that is closest to p and does not exceed p among the child nodes in the range ( Step 103). If not included, the message is transferred to the parent node, and the join processing is entrusted to the parent node (step 104).

なお、本転送は、時計回り方向に対して行われる。この転送処理をノード間で再帰的に繰り返すことにより、pを担当するノードに当該メッセージが到達する。pを管理するノードをノードmとした時、ノードmはノードpに対して自身の親、子ノード、ルーティングテーブルの情報を転送すると同時に、pを自身のルーティングテーブル、親、子ノードの情報に対して転送することにより反映する。以上の手順によって、各ノードがSkipList構造に対してJoin処理を完了し、SkipList構造が構築される。 This transfer is performed in the clockwise direction. By repeating this transfer processing between nodes recursively, the message reaches the node in charge of p. When the node that manages p is node m, node m forwards its parent, child node, and routing table information to node p, and at the same time, p becomes its routing table, parent, and child node information. Reflected by transferring to the other. By the above procedure, each node completes the Join process for the SkipList structure, and the SkipList structure is constructed.

次に、負荷分散部２２０について説明する。 Next, the load distribution unit 220 will be described.

構造化オーバレイ上で連続データの負荷分散を行うために各ノードがオーバレイ上をランダムウォークするクエリ（Sampling Query）を投げ、オーバレイ上のノードをランダムに選択しつつ、当該ノードの負荷情報を集めて、全体の負荷を推定する方法がある（例えば、A. R. Bharamde, M. Agrawal, and S. Sechan, "Mercury: Scalable Routing for Range Queries," in ACM SIGCOMM, 2004.参照）。この手法における問題点として、規模に応じて、Sampling Queryの送信回数を適切に指定する必要があり、集めた負荷情報を元にした全体の推定精度に影響を与える可能性がある。 In order to distribute the load of continuous data on the structured overlay, each node throws a query that randomly walks on the overlay (Sampling Query), and randomly selects a node on the overlay and collects load information on that node. There is a method for estimating the total load (see, for example, AR Bharamde, M. Agrawal, and S. Sechan, “Mercury: Scalable Routing for Range Queries,” in ACM SIGCOMM, 2004.). As a problem in this method, it is necessary to appropriately specify the number of times the Sampling Query is transmitted according to the scale, which may affect the overall estimation accuracy based on the collected load information.

既存技術に対して、本発明の負荷推定技術は、SkipListの集約構造を用いることで規模に依存せず全体の負荷の推定を的確に行うことが可能になっている。図７に負荷分散手法の概要を示す。具体的には、各ノードの構造化オーバレイ管理部２００の負荷分散部２２０が自身の負荷情報を親ノードに対して送信し、親ノードの負荷分散部２２０は負荷のランキングを作成する。ランキングは、子ノードから送信されてきた負荷情報から、負荷の高い順にノード情報をソートし、上位R（1≦R）番目までのノード情報と下位R（1≦R）番目までのノード情報を抽出する処理を根ノードまで再帰的に繰り返し転送する。根ノードの負荷分散部２２０は、負荷情報を参考に負荷の低いノードに負荷の高いノードが管理するID空間の中から負荷が均一になるようなHashIDをルーティングバッファ２１１のルーティングテーブルから選択し、当該負荷の低いノードに対してHashIDを追加するように指示する。さらに、当該ノードに負荷の高いノードに対して追加したHashIDを用いてJoinメッセージを送信するよう命令し、負荷の均一化を図る。なお、負荷の高いノードと負荷の低いノードを区別する判別式を以下に示す。根ノードが受け取ったランキングのi（1≦i≦R）番目の負荷情報をL_i、全体の負荷の平均値をAVG、ランキングのi番目の負荷値をQ_iとし、定数α（0＜α＜1）を用いて、Q_iは、 Compared to the existing technology, the load estimation technology of the present invention can accurately estimate the entire load without depending on the scale by using the SkipList aggregation structure. FIG. 7 shows an outline of the load balancing method. Specifically, the load distribution unit 220 of the structured overlay management unit 200 of each node transmits its own load information to the parent node, and the load distribution unit 220 of the parent node creates a load ranking. Ranking sorts node information in descending order of load from the load information transmitted from the child node, and node information up to the upper R (1 ≦ R) and node information up to the lower R (1 ≦ R) The extraction process is recursively transferred to the root node repeatedly. The load distribution unit 220 of the root node selects, from the routing table of the routing buffer 211, a hash ID that makes the load uniform from the ID space managed by the high load node to the low load node with reference to the load information, Instruct the low load node to add a HashID. Furthermore, the node is instructed to send a Join message using the added HashID to the node having a high load, and the load is made uniform. A discriminant for distinguishing between a high load node and a low load node is shown below. The i (1 ≦ i ≦ R) -th load information of the ranking received by the root node is L _i , the average value of the entire loads is AVG, the i-th load value of the ranking is Q _i , and a constant α (0 <α <1), Q _i is

と表すことにする。なお、以下で説明するランキングとは、集約した上位・下位のランキングを一体化した、1から2Rまでの順位を持つランキングと見なす。

It will be expressed as Note that the ranking described below is regarded as a ranking having a rank from 1 to 2R, which is a combination of the aggregated upper and lower rankings.

図８に負荷分散の動作のシーケンスチャートを示す。根ノードの構造化オーバレイ管理部２００の負荷分散部２２０において、式（２）を用いて、まずランキング1位の負荷情報L₁とランキング2R位の負荷情報L_2Rの負荷値Q₁，Q_2Rを求め、Q₁＝1、Q_2R＝0を満たす場合、根ノードは、L₁のノード（以下「ノード1」と記す）に対して、L_2Rのノード（以下、「ノード2R」と記す）がID空間の分割要求を行うように、ノード2Rに命令する（ステップ２０１）。その結果、ノード2Rの負荷分散部２２０は、ノード1のIPアドレスを根ノードから教えてもらい、ノード1に対して、自身の負荷情報L_2Rを通知する（ステップ２０２）。L_2Rを知ったノード1は、負荷分散部２２０において、自身の負荷L₁'が (L₁＋L_2R)/2となる様に、自身のID空間を分割し（ステップ２０３）、ノード2Rに対して分割したID空間と当該ID空間に関連するデータ群を、ノード2Rに送信し、ノード2Rは、MapReduce処理管理部１００のデータバッファ１３０に蓄積する（ステップ２０４）。 FIG. 8 shows a sequence chart of the load distribution operation. In the load distribution unit 220 of the structured overlay management unit 200 of the root node, first, the load values Q ₁ and Q _2R of the load information L ₁ ranked _first and the load information L _2R ranked 2R are calculated using Equation (2). look, if satisfying Q ₁ = 1, Q _2R = 0, root node, to a node L ₁ (hereinafter referred to as "node 1"), L _2R node (hereinafter, referred to as "node 2R" ) Instructs the node 2R to make an ID space division request (step 201). As a result, the load distribution unit 220 of the node _2R is informed of the IP address of the node 1 from the root node, and notifies the node 1 of its own load information L _2R (step 202). The node 1 that knows L _2R divides its ID space so that its load L ₁ ′ becomes (L ₁ + L _2R ) / 2 in the load balancer 220 (step 203), and the node 1R The ID space divided and the data group related to the ID space are transmitted to the node 2R, and the node 2R accumulates in the data buffer 130 of the MapReduce process management unit 100 (step 204).

なお、根ノードはノード1およびノード2Rのノードに対してID空間の分割要求を通知した後、次はノード2とノード2R-1に対しても同様の処理を行う。本処理は負荷の高いノードと負荷の低いノードの負荷値が1，0とならなくなるまで繰り返す。各負荷値が1、0とならない場合、根ノードは負荷分散を終了し、全体の負荷分散が完了する。なお、この負荷分散の処理は、根ノードにて一定時間間隔で行われるものとし、負荷情報を各ノードの親ノードに対して送信する命令を根ノードから送信し、各ノードの親ノードから子ノードに対して発行されるStabilizationメッセージに付加され、上層から下層のノードに転送されていく。これにより、負荷分散処理に必要なメッセージ数を抑制することができる。 Note that after the root node notifies the node 1 and the node 2R of the ID space division request, the same processing is performed for the node 2 and the node 2R-1. This process is repeated until the load values of the high load node and the low load node are no longer 1 or 0. If each load value does not become 1 or 0, the root node ends load distribution, and the entire load distribution is completed. Note that this load distribution processing is performed at a certain time interval in the root node, and a command for transmitting load information to the parent node of each node is transmitted from the root node, and the child node from each node It is added to the stabilization message issued to the node and transferred from the upper layer to the lower layer node. Thereby, the number of messages required for the load distribution process can be suppressed.

次に、Shuffle処理について説明する。 Next, Shuffle processing will be described.

センサやモニタから発行されたデータn（HashID: n）がノードpに到着したとする。データnに対してShuffle処理を行う際、ノードpは当該データを担当するノードに当該データを転送しなければならない。転送方法は図６で示した構築方法と同様の転送方法を用いる。 It is assumed that data n (HashID: n) issued from a sensor or monitor has arrived at node p. When performing the Shuffle process on the data n, the node p must transfer the data to the node in charge of the data. The transfer method is the same transfer method as the construction method shown in FIG.

具体的な検索手順を図９を用いて説明する。ID空間が10台のノード（０〜９）によって管理されており、各ノードが反時計回りに隣接するノード（Predecessor）間のIDを管理するものとする。さらに、当該ID空間は時間をIDにしたトラヒックデータを管理するものとし、時刻3:45のデータはノード4が管理し、時刻5:54のデータはノード6が管理するものとする。この時、ノード4に対して、時刻8:37のデータが到着し、当該データを管理するノードに転送（Shuffle）する処理の手順を考える。 A specific search procedure will be described with reference to FIG. It is assumed that the ID space is managed by 10 nodes (0 to 9), and each node manages an ID between adjacent nodes (Predecessor) counterclockwise. Further, the ID space is assumed to manage traffic data using time as an ID, node 4 manages data at time 3:45, and node 6 manages data at time 5:54. At this time, consider a procedure of processing in which data at time 8:37 arrives at the node 4 and is transferred (Shuffle) to the node that manages the data.

まず、ノード4のSkipListマネージャ２１０(4)は自身のルーティングテーブルを参照し、8:47が4と自身の経路情報全ての間に含まれるかどうか当該ルーティングテーブルの上位から調べる。この時、ルーティングテーブル上のどのSuccessorの間にも8:47が属さないと判断することができる。ノード4のSkipListマネージャ２１０(4)は親ノードであるノード0に対して、8:47のデータを管理するノードを探索するためのLookUpメッセージを転送することにする。LookUpメッセージを受信したノード0は、LookUpメッセージの宛先である8:47を読み取り、自身のルーティングテーブルの上位から、8:47を含むSuccessorを調べた結果、8:47を管理するノードを発見できなかったとする。次に、親、子ノードから8:47を管理するノードがあるかどうかを調べ、ノード8が管理していると判断し、ノード8に対して、当該LookUpメッセージを転送する。当該LookUpメッセージを受け取った、ノード8は自身のルーティングテーブルを上位から調べ、ノード9が8:47のデータを管理していると確定することができ、当該メッセージの送信元であるノード4に対して、ノード9のIPアドレスを伝達する。ノード9のIPアドレスを知ったノード4は8:47のデータをノード9に対して転送することによって、Shuffle処理が完了する。なお、Shuffle処理が完了すると、当該データはノード9のSkipListマネージャ２１０(9)Managerから上層のMapReduce処理管理部１００の中のデータバッファ１３０に対して自動的に蓄積される。 First, the SkipList manager 210 (4) of the node 4 refers to its own routing table and checks whether 8:47 is included between 4 and all of its own route information from the top of the routing table. At this time, it can be determined that 8:47 does not belong to any Successor on the routing table. The SkipList manager 210 (4) of the node 4 transfers a LookUp message for searching for a node that manages 8:47 data to the parent node Node 0. Node 0 that has received the LookUp message reads 8:47, the destination of the LookUp message, and, from the top of its routing table, examines the Successor including 8:47, and as a result, can find the node that manages 8:47. Suppose there wasn't. Next, it is checked whether there is a node that manages 8:47 from the parent and child nodes, and it is determined that the node 8 is managing, and the LookUp message is transferred to the node 8. Upon receiving the LookUp message, node 8 can check its own routing table from the top and determine that node 9 is managing 8:47 data. To transmit the IP address of the node 9. The node 4 that knows the IP address of the node 9 transfers the data of 8:47 to the node 9, thereby completing the shuffle process. When the Shuffle processing is completed, the data is automatically stored in the data buffer 130 in the upper layer MapReduce processing management unit 100 from the SkipList manager 210 (9) Manager of the node 9.

なお、センサやモニタから発行されたデータに対してShuffle処理を行う際の通信回数は、ノード数をnとすると1データあたり2log(n)となっており、ノード数が増えたとしてもShuffle処理を行う際の通信コストの増加率は低いため、規模拡張性に優れたShuffleを実現していることがわかる。 Note that the number of communications when performing shuffle processing on data issued from sensors and monitors is 2 logs (n) per data, where n is the number of nodes, and even if the number of nodes increases, shuffle processing Since the rate of increase in communication costs when performing, the Shuffle with excellent scalability is realized.

次に、SkipListマネージャ２１０における、SkipListの維持するための方法について説明する。 Next, a method for maintaining the SkipList in the SkipList manager 210 will be described.

SkipListを維持するために、各ノードのSkipListマネージャ２１０が各レベルのSuccessor、Predecessor、および親ノードに対して定期的にStabilizationメッセージを送信し、ノードの生存・離脱を確認する。応答がない場合は、離脱と判定し、SuccessorやPredecessor、Parentの候補となるノードを再度検索し、代替ノードの情報を得ることで、SkipList構造を更新する。 In order to maintain the SkipList, the SkipList manager 210 of each node periodically transmits a Stabilization message to each level of Successor, Predecessor, and parent node, and confirms the existence / extraction of the node. If there is no response, it is determined that the node has left, and the node that is a candidate for Successor, Predecessor, or Parent is searched again, and the information of the alternative node is obtained to update the SkipList structure.

この手順以外にも、SuccessorとPredecessorの数を増やすことで、容易にSkipList構造を維持することが可能になる。具体的には、各ノードが自身のSuccessorのSuccessorの情報やPredecessorのPredecessorの情報を持つということである。こうすることで、Successorが離脱しても、瞬時に、当該SuccessorのSuccessorに対して接続関係を結ぶだけで、SkipListの構造を維持することができる。 In addition to this procedure, the SkipList structure can be easily maintained by increasing the number of Successors and Predecessors. Specifically, each node has information on Successor of its own Successor and Predecessor of Predecessor. By doing so, even if the successor leaves, the structure of the skip list can be maintained simply by establishing a connection relationship with the successor of the successor instantly.

次に、MapReduce処理管理部１００の処理を説明する。 Next, the processing of the MapReduce process management unit 100 will be described.

SkipListに対して、解析スキームと呼ばれるユーザが定義する解析内容の反映方法を述べる。図１０に解析スキームの配布と実行手順の概要を示す。解析スキームは任意のノードに対して投入することができ、ユーザは、解析スキームに解析スキームの識別子（AID）を付けて投入する。解析スキームを全ノードに反映するために当該MapReduce処理管理部１００は、Scatter（AnalysisScheme）という構造化オーバレイ管理部２００が提供するAPIを用いる。Scatter APIを用いて、解析スキームの配布を構造化オーバレイ管理部２００に対して依頼すると、まず解析スキームを投入されたノードのSkipListマネージャ２１０は根ノードを探索する。根ノードを発見すると、当該根ノードに対して、ScatterAnalizerメッセージを転送し、根ノードが自身の子ノードに対して当該メッセージを転送する。ScatterAnalizerメッセージはユーザが定義した解析スキームと共にSkipListの木構造に従って、各ノードに対して再帰的に転送される。ScatterAnalizerメッセージを受け取った各ノードは、自身の同期部１２０（Scheme Coordinator）に対して、当該スキームのAIDと当該スキームを登録する。同期部１２０は、登録されている解析スキームの保存や、解析スキームの実行時において、子ノードからの解析結果の同期を行う。 A method of reflecting analysis contents defined by the user, called an analysis scheme, is described for SkipList. FIG. 10 shows an outline of the distribution and execution procedure of the analysis scheme. The analysis scheme can be input to any node, and the user inputs the analysis scheme with an analysis scheme identifier (AID). In order to reflect the analysis scheme to all nodes, the MapReduce process management unit 100 uses an API provided by the structured overlay management unit 200 called Scatter (AnalysisScheme). When the Scatter API is used to request the structured overlay management unit 200 to distribute the analysis scheme, first, the SkipList manager 210 of the node to which the analysis scheme is input searches for the root node. When the root node is found, the ScatterAnalizer message is transferred to the root node, and the root node transfers the message to its child node. ScatterAnalizer messages are recursively forwarded to each node according to a SkipList tree structure with a user-defined analysis scheme. Each node that receives the ScatterAnalizer message registers the AID of the scheme and the scheme in its own synchronization unit 120 (Scheme Coordinator). The synchronization unit 120 synchronizes the analysis results from the child nodes when saving the registered analysis scheme and executing the analysis scheme.

以下に、解析例を示す。 An example of analysis is shown below.

全てのノードに対して、1日のトラヒックデータが保存されているとする。トラヒックデータは時系列データであり、昼間のトラヒックデータは少ないが、夜間のトラヒックデータは莫大に存在しているものとする。これらのデータがShuffle処理によって、各データの担当ノードに対して転送され、各担当ノードは当該トラヒックデータを管理しているものとする。なお、トラヒックデータは時刻と通信端末のIPアドレスから構成しているものとする。このトラヒックデータから、通信回数が最も多い100位までの通信端末のIPアドレスを特定するといった内容の解析スキームをSkipList上のノードに反映していたとする。 Assume that the daily traffic data is stored for all nodes. It is assumed that the traffic data is time series data, and there is little daytime traffic data, but there is a huge amount of nighttime traffic data. These data are transferred to the node in charge of each data by Shuffle processing, and each node in charge manages the traffic data. It is assumed that the traffic data is composed of the time and the IP address of the communication terminal. It is assumed that an analysis scheme of contents such as specifying IP addresses of communication terminals up to 100th with the largest number of communication is reflected in the nodes on SkipList from this traffic data.

・負荷の均一化
ここでは、負荷の高いノードと低いノードのID空間の負荷の均一化について説明する。 -Load equalization Here, the load balancing of ID spaces of high load nodes and low nodes will be described.

図１１は、本発明の一実施の形態における負荷の均一化処理のシーケンスチャートである。 FIG. 11 is a sequence chart of the load equalization process according to the embodiment of the present invention.

一定時間間隔で、根ノードの構造化オーバレイ管理部２００のSkipListマネージャ２１０は、自身の子ノードに対して負荷情報の転送を依頼する（ステップ３０１）。その後、子ノードは自身の子ノードに対して当該依頼を転送する（ステップ３０２）。これらの処理を再帰的に行うことによって、レベル0のノード群に対して負荷情報の転送依頼が到達する(ステップ３０３)。レベル0のノードは、自身の負荷情報としてトラヒックデータの総量を親ノードに伝達する（ステップ３０４）。各親ノードから負荷情報を受信した親ノードは、負荷情報のランキングを作成し（ステップ３０５）、再度、自身の親ノードに対して転送する（ステップ３０６）。この処理を再帰的に行うことによって、根ノードにランキングが伝達される(ステップ３０７)。ランキングを受け取った根ノードは、負荷文ｓ何部２２０において、負荷の低いノードを負荷の高いノードが担当するID空間に対して、Join処理を行うように命令する（ステップ３０８）。例えば、負荷の高いノードの負荷が100、負荷の低いノードの負荷が10としたとき、負荷の低いノードは10と100の中間である55の負荷になるように、負荷の高いノードからID空間を分けてもらい、分けてもらったID空間に対してJoin処理を行う。これを繰り返すことによって、連続データの負荷の隔たりを解消することができる。 At regular time intervals, the SkipList manager 210 of the root node structured overlay management unit 200 requests its child nodes to transfer load information (step 301). Thereafter, the child node transfers the request to its child node (step 302). By performing these processes recursively, a load information transfer request arrives at the level 0 node group (step 303). The level 0 node transmits the total amount of traffic data as its own load information to the parent node (step 304). The parent node that has received the load information from each parent node creates a ranking of the load information (step 305) and transfers it again to its own parent node (step 306). By performing this process recursively, the ranking is transmitted to the root node (step 307). The root node that has received the ranking instructs the ID statement in which the high load node is responsible for the low load node to perform the join process in the load statement s section 220 (step 308). For example, if the load of a node with a high load is 100 and the load of a node with a low load is 10, a node with a low load has an ID space from the node with a high load so that the load is 55, which is between 10 and 100. , And join process is performed on the ID space. By repeating this, it is possible to eliminate the gap in the load of continuous data.

・解析プロセス
MapReduce処理管理部１００における当該解析プロセスの動作を図１２に示す。・ Analysis process
The operation of the analysis process in the MapReduce process management unit 100 is shown in FIG.

まず、各ノードが自身のMapReduce処理管理部１００の中のデータバッファ１３０内で管理するトラヒックデータから、通信端末のIP毎の通信回数のランキングを作成し（ステップ４０１）、上位100位までの通信端末のIPとその端末の通信回数を親ノードに転送する（ステップ４０２）。各親ノードは自身が管理する全ての子ノードから、ランキングが送られてくるまで待機し（ステップ４０３）、ランキングの結果が揃ったところで（ステップ４０４，４０５）、再度、100位までのランキングを受信データから作成し（ステップ４０６）、自身の親ノードに対してランキングを転送する(ステップ４０７)。このランキングの作成を再帰的に行うことによって、根ノードに対してランキングが到着する。根ノードにてランキングを再度構成した後に（ステップ４０８）、AIDに登録されている解析を依頼したノードに対して解析結果を転送する(ステップ４０９)。この時、各ノードの計算量は、負荷分散部２２０によって連続データであるトラヒックデータの負荷の偏りが平滑化されており、さらに各解析が平衡木の構造を持つ構造化オーバレイ上で再帰的に行われることにより、解析の並列性を最大化することができる。 First, each node creates a ranking of the number of communications for each IP of the communication terminal from the traffic data managed in the data buffer 130 in its own MapReduce processing management unit 100 (step 401), and communication to the top 100 The terminal IP and the communication count of the terminal are transferred to the parent node (step 402). Each parent node waits until rankings are sent from all the child nodes managed by itself (step 403). When the ranking results are complete (steps 404 and 405), ranking up to the 100th place is performed again. It is created from the received data (step 406), and the ranking is transferred to its parent node (step 407). By recursively creating this ranking, the ranking arrives at the root node. After the ranking is reconfigured in the root node (step 408), the analysis result is transferred to the node that requested the analysis registered in the AID (step 409). At this time, the load of the traffic data, which is continuous data, is smoothed by the load distribution unit 220, and each analysis is performed recursively on the structured overlay having a balanced tree structure. The parallelism of the analysis can be maximized.

なお、上記の図４に示すノードのMapReduce処理管理部１００、構造化オーバレイ管理部２００、通信部３００の各構成要素を、それぞれソフトウェア（プログラム)として構築し、MapReduce処理管理モジュール、構造化オーバレイ管理モジュール、通信モジュールとし、ノードとして利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させること可能である。 Each component of the mapreduce processing management unit 100, structured overlay management unit 200, and communication unit 300 of the node shown in FIG. 4 is constructed as software (program), and the mapreduce processing management module and structured overlay management. Modules and communication modules can be installed on a computer used as a node and executed, or distributed via a network.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications can be made within the scope of the claims.

１００ MapReduce処理管理部
１１０ Analysis Schemeバッファ
１２０同期部（Analysis Scheme Coordinator）
１３０データバッファ
２００構造化オーバレイ管理部
２１０ Shipマネージャ
２１１ルーティングバッファ
２２０負荷分散部(Load Balancing Executor)
３００通信部
３１０送信部
３２０受信部 100 MapReduce Processing Management Unit 110 Analysis Scheme Buffer 120 Synchronization Unit (Analysis Scheme Coordinator)
130 Data Buffer 200 Structured Overlay Management Unit 210 Ship Manager 211 Routing Buffer 220 Load Balancing Executor
300 communication unit 310 transmission unit 320 reception unit

Claims

A distributed data management system in a structured overlay network that manages a plurality of metadata in a distributed manner,
Having multiple nodes on a structured overlay network;
The node is
Having structured overlay management means and MapReduce processing management means,
The structured overlay management means includes:
Load balancing means for controlling the data held by the plurality of nodes so that the load is uniform among the nodes;
Structuring means for structuring the connections between the nodes to be balanced trees in order to analyze the data that has been made uniform by the load balancing means,
The MapReduce process management means
Data storage means for storing data transmitted to the overlay network;
Analysis management means for managing an analysis scheme for analyzing data stored in the data storage means;
A data distribution management system comprising: an analysis result integration unit that transmits an analysis result according to the analysis scheme to another node and combines the analysis results.

The MapReduce process management means
Maintaining the tree structure of the balanced tree for the connection relationship between nodes, Map processing for dividing a plurality of metadata into data of calculation units according to the tree structure of the balanced tree, Shuffle processing for transferring the divided data to the node in charge, 2. The data distribution management system according to claim 1, further comprising MapReduce processing means for performing Reduce processing for collecting and processing analysis results of each node device.

The load balancing means includes
When the own node is a root node, a node having a low load or a node having a high load is determined by a predetermined discriminant based on the load information of each node aggregated according to the tree structure of the balanced tree, and the load is high. The data according to claim 1, further comprising means for selecting a hash ID (HashID) that makes the load uniform among nodes from an ID space managed by the node, and adding the selected HashID to a node having a low load. Distributed management system.

A plurality of data distribution management devices (nodes) in a data distribution management system in a structured overlay network that manages a plurality of metadata in a distributed manner,
Having structured overlay management means and MapReduce processing management means,
The structured overlay management means includes:
Load balancing means for controlling the data held by the plurality of nodes so that the load is uniform among the nodes;
Structuring means for structuring the connections between the nodes to be balanced trees in order to analyze the data that has been made uniform by the load balancing means,
The MapReduce process management means
Data storage means for storing data transmitted to the overlay network;
Analysis management means for managing an analysis scheme for analyzing data stored in the data storage means;
A data distribution management device comprising: analysis result integration means for transmitting an analysis result according to the analysis scheme to another node and combining the analysis results.

The MapReduce process management means
Maintaining the tree structure of the balanced tree for the connection relationship between nodes, Map processing for dividing a plurality of metadata into data of calculation units according to the tree structure of the balanced tree, Shuffle processing for transferring the divided data to the node in charge, 5. The data distribution management device according to claim 4, further comprising MapReduce processing means for performing a Reduce processing for collecting and processing the analysis results of the respective node devices.

The load balancing means includes
When the own node is a root node, a node having a low load or a node having a high load is determined by a predetermined discriminant based on the load information of each node aggregated according to the tree structure of the balanced tree, and the load is high. 5. The distributed data management apparatus according to claim 4, further comprising means for selecting a Hash ID that makes the load uniform among nodes from an ID space managed by the node, and adding the selected Hash ID to a node having a low load. .

A data distribution management method in a data distribution management system in a structured overlay network that distributes and manages a plurality of metadata,
In a system having multiple nodes on a structured overlay network,
The structured overlay management means controls the data held by the plurality of nodes so that the load is uniform among the nodes, and in order to analyze the uniformed data, the connection between the nodes becomes a balanced tree. A structured overlay management step that is structured into
The MapReduce processing management unit stores data transmitted to the overlay network, manages an analysis scheme for analyzing the data stored in the data storage unit, and analyzes the analysis result by the analysis scheme to other nodes MapReduce process management step that combines the analysis results sent to
And
In the structured overlay management step, when equalizing the load,
When the own node is a root node, a node having a low load or a node having a high load is determined by a predetermined discriminant based on the load information of each node aggregated according to the tree structure of the balanced tree, and the load is high. A distributed data management method characterized by selecting a HashID that makes the load uniform among nodes from an ID space managed by the node, and adding the selected HashID to a node with a low load.

Computer
A data distribution management program for functioning as each unit of the data distribution management apparatus according to claim 4.