JP5637071B2

JP5637071B2 - Processing program, processing method, and processing apparatus

Info

Publication number: JP5637071B2
Application number: JP2011118969A
Authority: JP
Inventors: 克久中里
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-05-27
Filing date: 2011-05-27
Publication date: 2014-12-10
Anticipated expiration: 2031-05-27
Also published as: JP2012247979A

Description

本件は、処理プログラム、処理方法及び処理装置に関する。 This case relates to a processing program, a processing method, and a processing apparatus.

大量データを対象にした分析処理には、非常に長い処理時間を要する。これに対し、最近では、複数のマシンを用いて分散・並列処理を行うことで処理時間を短縮するアプローチがとられている。分散・並列処理としては、例えば、MapReduceアルゴリズムを用いた方法（例えば、非特許文献１参照）がある。また、MapReduceアルゴリズムのオープンソース実装として、Apache Hadoopが存在している。 An analysis process for a large amount of data requires a very long processing time. On the other hand, recently, an approach has been taken to reduce processing time by performing distributed / parallel processing using a plurality of machines. As the distributed / parallel processing, for example, there is a method using the MapReduce algorithm (see, for example, Non-Patent Document 1). In addition, Apache Hadoop exists as an open source implementation of the MapReduce algorithm.

MapReduceは、主に元のデータを多数のキーと値のセットに分割する「Ｍａｐ処理」と、それらのキーと値のセットをあるルールによって集約する「Ｒｅｄｕｃｅ処理」とによって構成される。Ｍａｐ処理及びＲｅｄｕｃｅ処理の各処理は、それぞれ複数並列に実行可能であるため、それらを複数の処理マシン（サーバなど）に割り当てることにより、複数マシンの処理性能を活用することができる。 MapReduce mainly includes a “Map process” that divides the original data into a large number of key and value sets, and a “Reduce process” that aggregates these key and value sets according to a certain rule. Since each of the Map processing and Reduce processing can be executed in parallel, the processing performance of the plurality of machines can be utilized by assigning them to a plurality of processing machines (such as servers).

ただし、MapReduceによる分散・並列処理の効果を高めるには、それぞれのＭａｐ処理、Ｒｅｄｕｃｅ処理の独立性を高くし、他の部分に依存せずに処理を行えるようにする必要がある。 However, in order to enhance the effect of distributed / parallel processing by MapReduce, it is necessary to increase the independence of each Map processing and Reduce processing so that the processing can be performed without depending on other portions.

分析処理の一種として、大量のデータ群の中から、関係のあるデータをグルーピングするものがある。例えば、図２２（ａ）に示すように、ある時期に行われた業務ログを、図２２（ｂ）に示すように、一連となっている業務フロー単位にグルーピングする場合などである。グルーピングの処理では、あるグループのデータ群を扱う際、別のグループのデータを考慮する必要が無いため、各グループの処理を複数サーバに分散させることにより、効率的に処理が行える。 One type of analysis processing is to group related data out of a large amount of data. For example, as shown in FIG. 22A, business logs performed at a certain time are grouped into a series of business flow units as shown in FIG. 22B. In the grouping process, it is not necessary to consider the data of another group when handling a data group of a certain group. Therefore, the process can be efficiently performed by distributing the processes of each group to a plurality of servers.

なお、図２２（ａ）のように一連の業務フローが１つのキー種（図２２（ａ）ではフローＩＤ）によって示されるデータをグルーピングする際には、MapReduceを用いることによってグルーピングは容易に達成される。MapReduceを行う処理マシンでは、あるキー値を持つデータ群を一箇所に集約する機能を標準で有しているためである。 In addition, when grouping data in which a series of business flows are represented by one key type (flow ID in FIG. 22A) as shown in FIG. 22A, the grouping is easily achieved by using MapReduce. Is done. This is because the processing machine that performs MapReduce has a standard function to aggregate a group of data having a certain key value in one place.

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters OSDI 2004Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters OSDI 2004

しかしながら、図２３（ａ）のように、一連の業務フローを示すキーが複数（図２３（ａ）では３種）存在する場合もあり得る。なお、図２３（ｂ）は、図２３（ａ）のデータを集約した例を示している。このような場合には、単純にはグルーピングを行うことができない。複数のキー種を用いて関連のあるデータ群を集約する処理（以下、「複数キー集約処理」と呼ぶ）では、どのキーの組み合わせが一連のデータ群を示すのかが、データ全体を見ないと完全には確定しないからである。例えば、図２３（ａ）の場合、伝票番号＝001で集約しようとすると、伝票明細詳細テーブルのデータを集約できない。一方、伝票明細詳細番号＝001-001-001で集約する場合、伝票テーブルのデータを集約できない。 However, as shown in FIG. 23A, there may be a plurality of keys (three types in FIG. 23A) indicating a series of business flows. FIG. 23B shows an example in which the data of FIG. In such a case, grouping cannot be performed simply. In the process of aggregating related data groups using multiple key types (hereinafter referred to as “multi-key aggregation process”), it is necessary to look at the entire data to determine which key combination represents a series of data groups. This is because it is not completely fixed. For example, in the case of FIG. 23 (a), if it is attempted to aggregate with the slip number = 001, the data of the slip detail table cannot be consolidated. On the other hand, when data is consolidated with the slip detail number = 001-001-001, the data in the slip table cannot be consolidated.

この場合、処理の進展に応じてキー値の組み合わせの情報を最新化しながら集約処理を進めるような工夫が必要であり、また、最新化する処理が不十分な場合には、データの集約漏れが発生する場合がある。 In this case, it is necessary to devise a way to proceed with the aggregation process while updating the information of the key value combination according to the progress of the process. May occur.

これに対し、キー種間の関連を管理する表をＲＤＢ（関係データベース(Relational Database））などに作成することも考えられる。しかるに、分散・並列処理する各処理マシンが共通に参照・更新する箇所があると、分散処理の性能・スケーラビリティが劣化するおそれがある。 On the other hand, it is also conceivable to create a table for managing the relationship between key types in an RDB (Relational Database). However, if there is a place where each processing machine that performs distributed / parallel processing refers / updates in common, the performance / scalability of the distributed processing may deteriorate.

そこで本件は上記の課題に鑑みてなされたものであり、複数のキー種で分類された複数のデータを集約する際に、データ集約漏れを防止し、性能・スケーラビリティの向上を図ることが可能な処理プログラム、処理方法及び処理装置を提供することを目的とする。 Therefore, this case has been made in view of the above problems, and it is possible to prevent omission of data aggregation and improve performance and scalability when a plurality of data classified by a plurality of key types is aggregated. It is an object to provide a processing program, a processing method, and a processing apparatus.

本明細書に記載の処理プログラムは、複数のキー種で分類された複数のデータを集約するコンピュータに、前記データ毎に関連するキーを集めたリストであるキーリスト、及び当該キーのうちの１つのキーである集約キーを関連付けたマップデータを生成して記憶部に記憶し、前記記憶部を参照して、同一の集約キーに関連付けられたマップデータのグループのキーリストを取得し、該取得したキーリストに含まれるキー数の中で最大のキー数と、取得した全てのキーリストをマージした合計キー数と、を比較し、前記比較の結果、前記合計キー数の方が大きかった場合に、当該マージしたキーリストに含まれるキーそれぞれを集約キーとし、キーリストとして前記マージしたキーリストを関連付けた新たなマップデータを、前記マージしたキーリストに含まれるキーの数分だけ生成して前記記憶部に記憶する、処理を実行させる処理プログラムである。 The processing program described in this specification includes a key list that is a list in which a plurality of data classified by a plurality of key types are aggregated, a key that is a list of keys associated with each data, and one of the keys. Map data associated with an aggregate key that is one key is generated and stored in a storage unit, and a key list of a group of map data associated with the same aggregate key is obtained by referring to the storage unit, and the acquisition is performed When the maximum number of keys in the number of keys included in the selected key list is compared with the total number of keys obtained by merging all the acquired key lists, and as a result of the comparison, the total number of keys is greater In addition, each of the keys included in the merged key list is set as an aggregate key, and the new map data associated with the merged key list as the key list is merged. Stored in the storage unit generates only a few minutes of the keys contained in Risuto a processing program for executing processing.

本明細書に記載の処理方法は、複数のキー種で分類された複数のデータを集約するコンピュータが、前記データ毎に関連するキーを集めたリストであるキーリスト、及び当該キーのうちの１つのキーである集約キーを関連付けたマップデータを生成して記憶部に記憶する工程と、前記記憶部を参照して、同一の集約キーに関連付けられたマップデータのグループのキーリストを取得し、該取得したキーリストに含まれるキー数の中で最大のキー数と、取得した全てのキーリストをマージした合計キー数と、を比較する工程と、前記比較する工程における比較の結果、前記合計キー数の方が大きかった場合に、当該マージしたキーリストに含まれるキーそれぞれを集約キーとし、キーリストとして前記マージしたキーリストを関連付けた新たなマップデータを、前記マージしたキーリストに含まれるキーの数分だけ生成して前記記憶部に記憶する工程と、を実行する処理方法である。 In the processing method described in this specification, a computer that aggregates a plurality of data classified by a plurality of key types includes a key list that is a list in which keys associated with each data are collected, and one of the keys. A step of generating map data associated with an aggregate key that is one key and storing the map data in a storage unit; referring to the storage unit; obtaining a key list of a group of map data associated with the same aggregate key; The step of comparing the maximum number of keys in the number of keys included in the acquired key list with the total number of keys obtained by merging all the acquired key lists, and the result of comparison in the comparing step, the total When the number of keys is larger, each key included in the merged key list is used as an aggregate key, and a new list in which the merged key list is associated as a key list. The Updater is a processing method for executing the steps of storing in the storage unit generates only a few minutes of the keys contained in the key list that the merged.

本明細書に記載の処理装置は、複数のキー種で分類された複数のデータを集約する処理装置であって、前記複数のデータ及び当該複数のデータから生成されるマップデータを記憶する記憶部と、前記データ毎に関連するキーを集めたリストであるキーリスト、及び当該キーのうちの１つのキーである集約キーを関連付けたマップデータを生成して前記記憶部に記憶する第１生成・記憶部と、前記記憶部を参照して、同一の集約キーに関連付けられたマップデータのグループのキーリストを取得し、該取得したキーリストに含まれるキー数の中で最大のキー数と、取得した全てのキーリストをマージした合計キー数と、を比較する比較部と、前記比較する工程における比較の結果、前記合計キー数の方が大きかった場合に、当該マージしたキーリストに含まれるキーそれぞれを集約キーとし、キーリストとして前記マージしたキーリストを関連付けた新たなマップデータを、前記マージしたキーリストに含まれるキーの数分だけ生成して前記記憶部に記憶する第２生成・記憶部と、を備える処理装置である。 The processing device described in this specification is a processing device that aggregates a plurality of data classified by a plurality of key types, and stores the plurality of data and map data generated from the plurality of data A key list that is a list of related keys for each data, and map data that associates an aggregate key that is one of the keys, and stores the map data in the storage unit. With reference to the storage unit, the storage unit, to obtain a key list of a group of map data associated with the same aggregate key, the maximum number of keys among the number of keys included in the acquired key list, A comparison unit that compares the total number of keys obtained by merging all the acquired key lists, and a result of comparison in the comparison step, if the total number of keys is greater, the merged key number Each key included in the list is used as an aggregate key, and new map data in which the merged key list is associated as a key list is generated by the number of keys included in the merged key list and stored in the storage unit. And a second generation / storage unit.

本明細書に記載の処理プログラム、処理方法及び処理装置は、複数のキー種で分類された複数のデータを集約する際に、データ集約漏れを防止し、性能・スケーラビリティの向上を図ることができるという効果を奏する。 The processing program, the processing method, and the processing device described in this specification can prevent omission of data aggregation and improve performance and scalability when a plurality of data classified by a plurality of key types are aggregated. There is an effect.

一実施形態に係る分散処理システムの構成を概略的に示す図である。1 is a diagram schematically illustrating a configuration of a distributed processing system according to an embodiment. FIG. 処理サーバのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a processing server. 図３（ａ）は、処理サーバの機能ブロック図であり、図３（ｂ）は、管理サーバの機能ブロック図である。FIG. 3A is a functional block diagram of the processing server, and FIG. 3B is a functional block diagram of the management server. MapReduce処理の基本的な処理内容について説明するための図である。It is a figure for demonstrating the basic processing content of MapReduce processing. MapReduce処理の一連の流れについて示すフローチャートである。It is a flowchart shown about a series of flows of MapReduce processing. 図６（ａ）は、処理対象データの一例を示す図であり、図６（ｂ）、図６（ｃ）は、図６（ａ）のデータから生成されるＭａｐデータを示す図である。FIG. 6A is a diagram illustrating an example of the processing target data, and FIG. 6B and FIG. 6C are diagrams illustrating Map data generated from the data in FIG. 集約対象のデータの一例を示す図である。It is a figure which shows an example of the data of aggregation object. 図８（ａ）、図８（ｂ）は、図７における、元キーがＹ＝０１０１で、関連キーがＸ＝０１，Ｙ＝０１０１のデータのから生成される２つのＭａｐデータを示す図である。FIGS. 8A and 8B are diagrams showing two Map data generated from the data in FIG. 7 where the original key is Y = 0101, the related keys are X = 01, and Y = 0101. is there. １回目の集約を示す図（その１）である。It is a figure (the 1) which shows 1st aggregation. 図５のステップＳ１４において、各処理サーバで並列実行されるＲｅｄｕｃｅ処理の具体的処理を示すフローチャートである。It is a flowchart which shows the specific process of the Reduce process performed in parallel by each process server in step S14 of FIG. １回目の集約を示す図（その２）である。It is a figure (the 2) which shows 1st aggregation. １回目の集約を示す図（その３）である。It is FIG. (The 3) which shows 1st aggregation. １回目の集約を示す図（その４）である。It is FIG. (The 4) which shows 1st aggregation. ２回目の集約を示す図（その１）である。It is a figure (the 1) which shows 2nd aggregation. ２回目の集約を示す図（その２）である。It is a figure (the 2) which shows 2nd aggregation. ２回目の集約を示す図（その３−１）である。It is a figure (the 3-1) which shows the 2nd aggregation. ２回目の集約を示す図（その３−２）である。It is a figure (the 3-2) which shows the 2nd aggregation. ３回目の集約を示す図である。It is a figure which shows the 3rd aggregation. ４回目の集約を示す図である。It is a figure which shows 4th aggregation. ５回目の集約を示す図である。It is a figure which shows the 5th aggregation. 図２１（ａ）は、図５のステップＳ１６の判断が肯定されたときのデータの集約状態を示す図であり、図２１（ｂ）は、ステップＳ１８、Ｓ２２０が行われた後のデータの集約状態を示す図である。FIG. 21A is a diagram showing a data aggregation state when the determination in step S16 in FIG. 5 is affirmed, and FIG. 21B is a data aggregation after steps S18 and S220 are performed. It is a figure which shows a state. 従来例を説明するための図（その１）である。It is FIG. (1) for demonstrating a prior art example. 従来例を説明するための図（その２）である。It is FIG. (2) for demonstrating a prior art example.

以下、一実施形態について、図１〜図２１に基づいて詳細に説明する。図１には、分散処理システム１００の構成が概略的に示されている。本実施形態の分散処理システム１００では、複数のキー種で分類されたデータを集約する「複数キー集約処理」を行うこととし、当該処理では、MapReduceアルゴリズムを適用するものとする。ここで、集約とは、同一のキーのデータを取得し纏めることを意味する。 Hereinafter, an embodiment will be described in detail with reference to FIGS. FIG. 1 schematically shows the configuration of the distributed processing system 100. In the distributed processing system 100 according to the present embodiment, “multiple key aggregation processing” for aggregating data classified by a plurality of key types is performed, and the MapReduce algorithm is applied in this processing. Here, aggregation means obtaining and collecting data of the same key.

分散処理システム１００は、図１に示すように、処理を実行するｎ台の処理装置（コンピュータ）としての処理サーバ１０と、各処理サーバ１０の処理を管理する管理サーバ２０と、を備える。各処理サーバ１０と管理サーバ２０は、ＬＡＮ（Local Area Network）、インターネットなどのネットワーク３０に接続されている。 As shown in FIG. 1, the distributed processing system 100 includes a processing server 10 as n processing devices (computers) that execute processing, and a management server 20 that manages processing of each processing server 10. Each processing server 10 and management server 20 are connected to a network 30 such as a LAN (Local Area Network) or the Internet.

図２には、処理サーバ１０のハードウェア構成が示されている。図２に示すように、処理サーバ１０は、ＣＰＵ９０、ＲＯＭ９２、ＲＡＭ９４、記憶部（ここではＨＤＤ（Hard Disk Drive））９６、ネットワークインタフェース９７、可搬型記憶媒体用ドライブ９９等を備えている。これら処理サーバ１０の構成各部は、バス９８に接続されている。処理サーバ１０では、ＲＯＭ９２あるいはＨＤＤ９６に格納されているプログラム（処理プログラム）、又は可搬型記憶媒体用ドライブ９９が可搬型記憶媒体９１から読み出したプログラム（処理プログラム）をＣＰＵ９０が実行することにより、図３（ａ）の各部の機能が実現される。なお、管理サーバ２０も処理サーバ１０と同様の構成を有しているが、管理サーバ２０では、ＲＯＭあるいはＨＤＤに格納されているプログラム（管理プログラム）、又は可搬型記憶媒体用ドライブが可搬型記憶媒体から読み出したプログラム（管理プログラム）をＣＰＵが実行することで、図３（ｂ）の各部の機能が実現される。なお、図３（ａ）、図３（ｂ）の各機能の詳細については、後述する。 FIG. 2 shows the hardware configuration of the processing server 10. As shown in FIG. 2, the processing server 10 includes a CPU 90, a ROM 92, a RAM 94, a storage unit (here, HDD (Hard Disk Drive)) 96, a network interface 97, a portable storage medium drive 99, and the like. Each component of the processing server 10 is connected to the bus 98. In the processing server 10, the CPU 90 executes a program (processing program) stored in the ROM 92 or the HDD 96 or a program (processing program) read from the portable storage medium 91 by the portable storage medium drive 99. The function of each part 3 (a) is realized. The management server 20 has the same configuration as that of the processing server 10, but in the management server 20, a program (management program) stored in the ROM or HDD or a drive for a portable storage medium is portable storage. When the CPU executes a program (management program) read from the medium, the functions of the respective units in FIG. 3B are realized. The details of the functions shown in FIGS. 3A and 3B will be described later.

図１に戻り、各処理サーバ１０のディスク（ＨＤＤ９６）は、仮想的に１つのディスクに見える記憶部としての分散ファイルシステム４０に組み込まれている。なお、図示の便宜上、図１では、ＨＤＤ９６を各処理サーバ１０の外側に出して示している。 Returning to FIG. 1, the disk (HDD 96) of each processing server 10 is incorporated in a distributed file system 40 as a storage unit that appears virtually as one disk. For convenience of illustration, in FIG. 1, the HDD 96 is shown outside each processing server 10.

図３（ａ）には、処理サーバ１０の機能ブロック図が示されている。処理サーバ１０は、ＣＰＵ９０が処理プログラムを実行することで、図３（ａ）に示す、第１生成・記憶部としてのＭａｐ処理部１２、第２生成・記憶部及び比較部としてのＲｅｄｕｃｅ処理部１４、としての機能を実現する。 FIG. 3A shows a functional block diagram of the processing server 10. The processing server 10 causes the CPU 90 to execute a processing program, thereby causing the Map processing unit 12 as the first generation / storage unit, the Reduce processing unit as the second generation / storage unit, and the comparison unit illustrated in FIG. 14 is realized.

Ｍａｐ処理部１２は、分散ファイルシステム４０に格納されているデータを用いて、後述するＭａｐ処理を実行する。Ｒｅｄｕｃｅ処理部１４は、Ｍａｐ処理部１２においてＭａｐ処理されたデータ（Ｍａｐデータ）を用いて、後述するＲｅｄｕｃｅ処理を行う。 The Map processing unit 12 executes Map processing, which will be described later, using data stored in the distributed file system 40. The Reduce processing unit 14 performs Reduce processing, which will be described later, using the data (Map data) that has been subjected to the Map processing in the Map processing unit 12.

図３（ｂ）には、管理サーバ２０の機能ブロック図が示されている。管理サーバ２０では、ＣＰＵが管理プログラムを実行することで、図３（ｂ）に示す、処理管理部２２、キー数増加フラグ管理部２４、通信部２６、としての機能を実現する。なお、本実施形態では、管理サーバ２０における管理プログラムの実行に連動して各処理サーバ１０の処理プログラムが起動されるようになっている。 FIG. 3B shows a functional block diagram of the management server 20. In the management server 20, the CPU executes the management program, thereby realizing the functions as the process management unit 22, the key number increase flag management unit 24, and the communication unit 26 illustrated in FIG. In the present embodiment, the processing program of each processing server 10 is started in conjunction with the execution of the management program in the management server 20.

処理管理部２２は、各処理サーバ１０の状態を通信部２６を介して把握するとともに、キー数増加フラグ管理部２４が管理する後述するキー数増加フラグを参照して、各処理サーバ１０に対して処理を実行させる。 The process management unit 22 grasps the state of each processing server 10 via the communication unit 26 and refers to a key number increase flag (described later) managed by the key number increase flag management unit 24 to each processing server 10. To execute the process.

キー数増加フラグ管理部２４は、通信部２６を介して各処理サーバ１０から取得した情報に基づいて、キー数増加フラグの値を管理する。キー数増加フラグは、「０」又は「１」の値をとりうるものである。なお、キー数増加フラグの値は、キー数増加フラグ管理部２４のみが参照可能である。したがって、各処理サーバ１０においては、カウンタの値を参照せずに、キー数増加フラグ管理部２４に対してフラグの値を変更する指示をリモートかつ非同期的にて行うのみである。このため、処理プログラムの処理がカウンタの影響で待たされたりすることはない。 The key number increase flag management unit 24 manages the value of the key number increase flag based on information acquired from each processing server 10 via the communication unit 26. The key number increase flag can take a value of “0” or “1”. Note that only the key number increase flag management unit 24 can refer to the value of the key number increase flag. Therefore, in each processing server 10, the key number increase flag management unit 24 is only remotely and asynchronously instructed to change the flag value without referring to the counter value. For this reason, the processing of the processing program is not waited under the influence of the counter.

通信部２６は、処理サーバ１０との間で通信を行い、通信結果を処理管理部２２及びキー数増加フラグ管理部２４に対して送信する。また、通信部２６は、処理管理部２２からの指示を処理サーバ１０に対して送信する。 The communication unit 26 communicates with the processing server 10 and transmits a communication result to the processing management unit 22 and the key number increase flag management unit 24. In addition, the communication unit 26 transmits an instruction from the process management unit 22 to the processing server 10.

なお、上記構成は、分散処理ミドルウェアとしてApache Hadoopを利用したときの標準的な構成である。したがって、分散ファイルシステム４０や、処理プログラムの起動・停止・制御、及びカウンタの機能はHadoopの基本機能として実装されているものである。 The above configuration is a standard configuration when Apache Hadoop is used as the distributed processing middleware. Therefore, the functions of the distributed file system 40, processing program start / stop / control, and counter are implemented as basic functions of Hadoop.

次に、MapReduce処理の基本的な処理内容について、図４に基づいて説明する。処理の流れとして、最初は、処理対象のデータは全て管理サーバ２０上のストレージに格納されているものとする。これを管理サーバ２０が分散ファイルシステム４０にアップロードすると、処理対象のデータは自動的に分割され、均一な量になるように、各処理サーバ１０のストレージに格納される。この後に、MapReduce処理が実行される。図４は、MapReduce処理の基本概念を示す図である。 Next, basic processing contents of the MapReduce processing will be described with reference to FIG. As a processing flow, initially, it is assumed that all data to be processed is stored in the storage on the management server 20. When the management server 20 uploads this to the distributed file system 40, the data to be processed is automatically divided and stored in the storage of each processing server 10 so as to have a uniform amount. After this, MapReduce processing is executed. FIG. 4 is a diagram showing the basic concept of the MapReduce process.

MapReduce処理では、分散ファイルシステム４０上の処理対象のデータを、主キー（集約に用いるキーであるので「集約キー」とも呼ばれる）と値からなるＭａｐデータに分割する処理（Ｍａｐ処理）と、主キーの値に応じてＭａｐデータを纏める処理（Ｒｅｄｕｃｅ処理）とを、各処理サーバ１０において分散・並列的に行う。Ｍａｐ処理における主キーと値の形式は自由であり、また、Ｍａｐデータをひとつも生成しないＭａｐ処理もあり得る。例えば、各処理サーバ１０において分散・並列的に行われたＭａｐ処理によって、図４の上段に示すようなＭａｐデータが生成されたとする。 In MapReduce processing, processing data on the distributed file system 40 is divided into map data consisting of a primary key (which is also called an “aggregation key” because it is a key used for aggregation) and a value (Map processing), The processing (Reduce processing) for collecting the Map data according to the key value is performed in a distributed and parallel manner in each processing server 10. The format of the primary key and the value in the Map process is arbitrary, and there may be a Map process that does not generate any Map data. For example, it is assumed that Map data as shown in the upper part of FIG. 4 is generated by Map processing performed in a distributed and parallel manner in each processing server 10.

一方、Ｒｅｄｕｃｅ処理とは、Ｍａｐ処理によって生成されたＭａｐデータを元データとし、その主キーの値によって処理サーバを振分け、ある基準でグループ化したＭａｐデータ群に対して何らかの処理を行うことを指す。一般的には、同一の主キー値を持つＭａｐデータ群をグループ化し、そのＭａｐデータのグループに対して処理を行う。 On the other hand, the Reduce process means that the Map data generated by the Map process is used as the original data, the processing server is distributed according to the primary key value, and some process is performed on the Map data group grouped according to a certain standard. . In general, a group of Map data having the same primary key value is grouped, and processing is performed on the group of Map data.

図４に示したのは、最も基本的な処理サーバの振分け方法である。具体的には、各処理サーバ１０又は管理サーバ２０は、各Ｍａｐデータの主キーの値に対し、一意なハッシュ値を公知の計算方法により計算し、そのハッシュ値を処理サーバ数（図４では３）で割った剰余（０〜２）を求める。この場合、予め、各処理サーバ１０に関し、対応する剰余値を、図４の下段に示す番号（０〜２）で決めておくことで、各Ｍａｐデータを処理する処理サーバ１０を決定することができる。なお、同一の値の主キーに対しては常に同一のハッシュ値が得られるため、同一の主キーの値を持ったＭａｐデータ群は１つの処理サーバに集められる。また、ハッシュ値が偏りのない前提であれば、各Ｍａｐデータの処理を各サーバに偏りなく分散させることができる。なお、上述したハッシュ値から一意に処理サーバを決定する方法は、最も単純な例である。したがって、例えば、ハッシュ値に加えて、その時点の処理サーバの負荷を考慮に入れるなどして、より高度に処理サーバを決定することとしてもよい。なお、一般には処理サーバ１０の数より主キーの値の種類の方が多くなるので、各処理サーバ１０は複数のＭａｐデータグループを処理する。各処理サーバ１０は、マルチプロセスまたはマルチスレッドにより、同時に複数のＭａｐデータグループの処理を行う。 FIG. 4 shows the most basic processing server distribution method. Specifically, each processing server 10 or management server 20 calculates a unique hash value by a known calculation method for the value of the primary key of each Map data, and calculates the hash value by the number of processing servers (in FIG. 4). Find the remainder (0-2) divided by 3). In this case, regarding each processing server 10, the processing server 10 that processes each Map data can be determined by determining the corresponding remainder value with the numbers (0 to 2) shown in the lower part of FIG. 4. it can. Since the same hash value is always obtained for the primary key having the same value, the Map data group having the same primary key value is collected in one processing server. Also, if the hash value is premised on that there is no bias, the processing of each Map data can be distributed to each server without bias. Note that the method for uniquely determining a processing server from the hash value described above is the simplest example. Therefore, for example, in addition to the hash value, the processing server at that time may be taken into consideration and the processing server may be determined at a higher level. In general, since the types of primary key values are larger than the number of processing servers 10, each processing server 10 processes a plurality of Map data groups. Each processing server 10 simultaneously processes a plurality of Map data groups by multi-process or multi-thread.

各処理サーバ１０（Ｒｅｄｕｃｅ処理部１４）は、集められたＭａｐデータの主キーの値を参照する。そして、各処理サーバ１０（Ｒｅｄｕｃｅ処理部１４）は、同一の主キーの値を持ったＭａｐデータ（Ｍａｐデータ群）を１つのグループとし、そのグループに対してＲｅｄｕｃｅ処理を行う（図４の下段における主キー＝ＣＣＣのグループ参照）。なお、管理サーバ２０（処理管理部２２）は、各処理サーバ１０の状態を把握しているため、各処理サーバ１０におけるＭａｐ処理やＲｅｄｕｃｅ処理が完了したかどうかについても把握している。このため、管理サーバ２０は、Ｒｅｄｕｃｅ処理の結果を受けて、再度各処理サーバ１０にＭａｐ処理を実行させるなどすることで、MapReduce処理を繰り返し行うことができる。 Each processing server 10 (Reduce processing unit 14) refers to the value of the primary key of the collected Map data. Each processing server 10 (Reduce processing unit 14) sets Map data (Map data group) having the same primary key value as one group, and performs Reduce processing on the group (lower part of FIG. 4). Primary key = see CCC group). Since the management server 20 (process management unit 22) knows the state of each process server 10, it also knows whether the map process and the reduce process in each process server 10 have been completed. For this reason, the management server 20 can repeatedly perform the MapReduce process by receiving the result of the Reduce process and causing each process server 10 to execute the Map process again.

次に、本実施形態の分散処理システム１００における、複数キー集約処理の詳細について、説明する。この場合の複数キー集約処理とは、図２３（ａ）に示すような一連の業務フローを示すキーの種類（キー種）が複数存在する場合の集約処理を意味する。 Next, details of the multiple key aggregation processing in the distributed processing system 100 of this embodiment will be described. The multiple key aggregation process in this case means an aggregation process when there are a plurality of key types (key types) indicating a series of business flows as shown in FIG.

図５は、複数キー集約処理における具体的な処理の流れを示すフローチャートである。この図５の処理は、管理サーバ２０の処理管理部２２が、処理対象の元データを分散ファイルシステム４０にアップロードした後に開始される。図５の複数キー集約処理では、Ｍａｐ処理とＲｅｄｕｃｅ処理とが必要回数繰り返される。このMapReduce処理の繰り返しによって、データのグループ化が進展し、最後に複数キーで紐付いたデータを正しくグループ化した結果を得ることができる。なお、以下においては、１回のMapReduce処理を「集約」と呼ぶものとする。 FIG. 5 is a flowchart showing a specific processing flow in the multiple key aggregation processing. The process of FIG. 5 is started after the process management unit 22 of the management server 20 uploads the original data to be processed to the distributed file system 40. In the multiple key aggregation process of FIG. 5, the Map process and the Reduce process are repeated as many times as necessary. By repeating this MapReduce process, the grouping of data progresses, and the result of correctly grouping the data linked with multiple keys at the end can be obtained. In the following, one MapReduce process is referred to as “aggregation”.

図５の処理では、まず、ステップＳ１０において、管理サーバ２０のキー数増加フラグ管理部２４が、キー数増加フラグを０に初期化する。 In the process of FIG. 5, first, in step S10, the key number increase flag management unit 24 of the management server 20 initializes the key number increase flag to zero.

次いで、ステップＳ１２では、処理サーバ１０のＭａｐ処理部１２が、元データからＭａｐデータを生成するＭａｐ処理を実行する。各処理サーバ１０のＭａｐ処理の起動は、Hadoopの機能で実行される。各サーバのＭａｐ処理では、分散ファイルシステム４０にアップロードされた処理対象データのうち、そのサーバに格納されている元データ（例えば図６（ａ））に対し、図６（ｂ）、図６（ｃ）に示すような構造のＭａｐデータを生成する処理を行う。なお、一般的にHadoopでは、分散ファイルシステム４０にアップロードされたファイルは、冗長性のために同じ内容のデータを複数のサーバにコピーして保持する。しかしながら、ここでは、説明の単純化のため、同じ内容のデータは１箇所にのみ存在するものとして説明する。ここで、図６（ａ）のデータは、複数のキーによって関連付けられたデータであるので、各レコードは必ず主キーを持ち、また、０〜複数個の関連キーを持っている。１つのレコードの中でどの項目が主キーで、どの項目が関連キーであるかは自明であることとし、あらかじめ定義しておくこととする。例えば、複数個のキーの中で辞書式に最初のキーとなるキー（Ｘ→Ｙ→Ｚの順、かつ数値の小さい順に先頭となるキー）が主キーであるものとする。この定義は管理サーバ２０上に存在し、Ｍａｐ処理の起動の際のパラメータとして、各処理サーバ１０に通知される。Ｍａｐデータは、「主キー」と「値」を有し、「値」にはキーリストとイベントリストが格納される。ここでイベントとは入力データそのものを指す。 Next, in step S12, the Map processing unit 12 of the processing server 10 executes Map processing for generating Map data from the original data. The Map process of each processing server 10 is activated by the Hadoop function. In the map processing of each server, among the processing target data uploaded to the distributed file system 40, the original data (for example, FIG. 6A) stored in the server is compared with FIG. 6B and FIG. A process of generating Map data having a structure as shown in c) is performed. In general, in Hadoop, a file uploaded to the distributed file system 40 is copied and held in a plurality of servers with the same data for redundancy. However, here, for simplification of explanation, it is assumed that data having the same content exists only in one place. Here, since the data in FIG. 6A is data associated with a plurality of keys, each record always has a primary key and also has 0 to a plurality of related keys. It is obvious that which item is a primary key and which item is a related key in one record and is defined in advance. For example, it is assumed that a key that is the first key in a lexicographic manner among a plurality of keys (a key that is the head in the order of X → Y → Z and in ascending order of numerical values) is the main key. This definition exists on the management server 20 and is notified to each processing server 10 as a parameter when starting the Map processing. Map data has a “primary key” and a “value”, and a key list and an event list are stored in the “value”. Here, the event refers to the input data itself.

Ｍａｐ処理部１２では、Ｍａｐ処理に際して、処理対象データの１レコードを読込み、その主キーと関連キーの項目名・値をリストアップし、キーリストを作成する。図６（ａ）は、伝票明細詳細テーブルの１レコードを処理する場合の例で、「伝票明細詳細番号（Ｙ）」が主キーで、その値は「０１０１」である。また、関連キーは「伝票明細番号（Ｚ）」で、値は「０１０１０１」である。よって、キーリストは２つのキーから構成されることになる。 In the Map process, the Map processing unit 12 reads one record of data to be processed, lists up item names and values of the primary key and related keys, and creates a key list. FIG. 6A shows an example of processing one record of the slip detail table, where “slip detail number (Y)” is the primary key and its value is “0101”. The related key is “slip detail number (Z)” and the value is “010101”. Therefore, the key list is composed of two keys.

また、Ｍａｐ処理部１２は、キーリストの要素数分（キー数分）のＭａｐデータを作成する。図６（ａ）の例ではキーリストに含まれるキー数は、２個であるので、生成するＭａｐデータは図６（ｂ）、図６（ｃ）に示す２個になる。生成した各Ｍａｐデータの主キーには、キーリストの各要素を用いる。よって、１つ目のＭａｐデータの主キーは「伝票明細詳細番号（Ｙ）」の「０１０１」になり、２つ目のＭａｐデータの主キーは「伝票明細番号（Ｚ）」の「０１０１０１」になる。 The Map processing unit 12 creates Map data for the number of elements in the key list (for the number of keys). In the example of FIG. 6A, since the number of keys included in the key list is two, the generated map data is two as shown in FIGS. 6B and 6C. Each element of the key list is used as the primary key of each generated Map data. Therefore, the primary key of the first Map data is “0101” of “slip detail number (Y)”, and the primary key of the second Map data is “010101” of “slip detail number (Z)”. become.

また、Ｍａｐ処理部１２は、作成したＭａｐデータのうち、キーの値を辞書式に評価し、先頭となるＭａｐデータのイベントリストにのみ、入力レコードの内容（データ内容）を１要素として格納する。初回の集約時には、イベントリストに格納されるのは最大で1個のみとなる。キー値が先頭以外のＭａｐデータに関しては、イベントリストは空にする（図６（ｃ）参照）。なお、元の入力データに、その後の処理では用いない無駄なデータが含まれている場合、どの項目が不要かを定義しておくことで、イベントリスト格納の際にフィルタリングをすることが可能である。 Also, the Map processing unit 12 evaluates the key value in the generated Map data lexicographically, and stores the content (data content) of the input record as one element only in the event list of the top Map data. . At the time of the first aggregation, the event list stores only a maximum of one. For Map data whose key value is other than the head, the event list is emptied (see FIG. 6C). If the original input data includes useless data that is not used in the subsequent processing, it is possible to filter when storing the event list by defining which items are unnecessary. is there.

ここで、図７には、本実施形態における複数の処理対象データの初期状態の一例が示されている。この図７の例では、太線矩形枠がイベントデータを示し、破線矩形枠がキーリストを示している。また、１つのイベントが入力データ１レコードに対応し、各イベント下部のキーリストは、そのレコードに含まれる主キーおよび関連キーを示している。本実施形態ではキーリストにおいて主キーと関連キーの区別をする必要はないが、イベントの判別のため、入力データの主キーの値を「元キー」として表記している。また、図７において各データを結ぶ破線は、関連キーと主キーの関係を示している。入力データは３層からなり、１番下の層から２番目の層、２番目から最上位の層へ関連が設定されている。図７では、最終的に２つのグループ（一点鎖線で分割されたグループ）にグループ化されるのが正解である。また、図７では、左側のグループのＺ＝０１０１０１等において、下層の１種類のキーが、上層の複数のキーに関連付いているケースが含まれている。このため、単純に最下層から上の層へ向かって集約を繰り返す方法で集約させる方法は適用できない。 Here, FIG. 7 shows an example of an initial state of a plurality of processing target data in the present embodiment. In the example of FIG. 7, the bold rectangle frame indicates event data, and the broken rectangle frame indicates a key list. One event corresponds to one record of input data, and the key list below each event indicates the main key and related keys included in the record. In the present embodiment, it is not necessary to distinguish between the main key and the related key in the key list, but the value of the main key of the input data is described as “original key” for event determination. In FIG. 7, the broken line connecting the data indicates the relationship between the related key and the primary key. The input data consists of three layers, and the relationship is set from the bottom layer to the second layer and from the second layer to the top layer. In FIG. 7, the correct answer is finally grouped into two groups (groups divided by a one-dot chain line). FIG. 7 includes a case where one type of lower layer key is associated with a plurality of upper layer keys in the left group, such as Z = 0101101. For this reason, it is not possible to apply a method of aggregation by simply repeating aggregation from the lowest layer to the upper layer.

この図７に示す各処理対象のデータに関しても、Ｍａｐ処理部１２によって、上述したようなＭａｐ処理が行われる。例えば、元キーがＹ＝０１０１で、関連キーがＸ＝０１，Ｙ＝０１０１のデータの場合、図８（ａ）、図８（ｂ）に示すような２つのＭａｐデータが生成されることになる。これら２つのＭａｐデータは、図９において、矢印（→）で示すＭａｐデータである。なお、図７のその他のデータについても同様にＭａｐ処理が行われ、図９に示すような多数のＭａｐデータが生成されることになる。なお、図９において、各Ｍａｐデータの上部に記載されているキーは、主キーを意味する。 For the data to be processed shown in FIG. 7, the map processing unit 12 performs the map processing as described above. For example, when the original key is Y = 0101 and the related key is X = 01, Y = 0101, two pieces of Map data as shown in FIGS. 8A and 8B are generated. Become. These two Map data are Map data indicated by arrows (→) in FIG. Note that the Map process is similarly performed on the other data in FIG. 7, and a large number of Map data as shown in FIG. 9 is generated. In FIG. 9, the key described at the top of each Map data means the main key.

Ｍａｐ処理完了後、生成されたＭａｐデータは図４で説明したように各処理サーバに再分配される。Ｍａｐ処理の完了検知、Ｍａｐデータの振分け、Ｒｅｄｕｃｅ処理の起動はHadoopの機能で実行される。 After the completion of the Map process, the generated Map data is redistributed to each processing server as described with reference to FIG. The detection of the completion of the Map process, the distribution of the Map data, and the start of the Reduce process are executed by the Hadoop function.

図５に戻り、ステップＳ１４では、Ｒｅｄｕｃｅ処理部１４が、主キー毎にデータ群を集約し、Ｒｅｄｕｃｅ処理を実行する。このステップＳ１４では、具体的には、各処理サーバ１０で並列的に図１０のフローチャートに沿った処理が実行される。なお、Ｒｅｄｕｃｅ処理は、主キーの値が一致するＭａｐデータ群に対して実行される。ここでは、その処理単位を「集約グループ」または「グループ」と呼ぶことにする。 Returning to FIG. 5, in step S 14, the Reduce processing unit 14 aggregates the data group for each primary key and executes the Reduce process. In this step S14, specifically, the processing according to the flowchart of FIG. It should be noted that the Reduce process is executed for the Map data group having the same primary key value. Here, the processing unit is referred to as “aggregation group” or “group”.

図１０の処理では、ステップＳ３０において、Ｒｅｄｕｃｅ処理部１４が、各グループ（主キーがＸ＝０１〜Ｚ＝０１０１０６）のイベントリスト、キーリスト、及び最大入力キー数を初期化する。この場合、最大入力キー数の初期値は０である。なお、図１１は、図９のＭａｐデータをグループ毎に纏めて示す図である。これ以降、ステップＳ３２からステップＳ４０においては、各Ｍａｐデータを対象とするループ処理が行われる。 In the process of FIG. 10, in step S30, the Reduce processing unit 14 initializes an event list, a key list, and the maximum number of input keys of each group (primary keys are X = 01 to Z = 010106). In this case, the initial value of the maximum number of input keys is 0. FIG. 11 is a diagram showing the Map data of FIG. 9 collectively for each group. Thereafter, in steps S32 to S40, a loop process for each Map data is performed.

ステップＳ３２では、Ｒｅｄｕｃｅ処理部１４が、処理対象のＭａｐデータを取得する。この場合、各処理サーバ１０のＲｅｄｕｃｅ処理部１４は、図１１で示される集約グループ（主キーの値が一致するデータ群）の中から１つのＭａｐデータを取得することになる。 In step S32, the Reduce processing unit 14 acquires map data to be processed. In this case, the Reduce processing unit 14 of each processing server 10 acquires one Map data from the aggregation group (data group having the same primary key value) shown in FIG.

次いで、ステップＳ３４では、Ｒｅｄｕｃｅ処理部１４が、Ｍａｐデータのキーリストの要素数（キー数）が、これまでに取得した同一グループ内の最大入力キー数より大きければ、そのキー数を最大入力キー数に設定する。すなわち、Ｒｅｄｕｃｅ処理部１４は、取得したＭａｐデータ（群）に対し、そのＭａｐデータのキーリストの要素数（キー数）をカウントし、グループの最大入力キー数より大きければ、グループの最大入力キー数をその値で更新する。したがって、集約グループの全てのＭａｐデータに対してステップＳ３４を行った場合には、最終的に、グループの最大入力キー数は、入力されたＭａｐデータ群のキーリストのうち、最大のもののキー数と一致することになる。例えば、図１１の主キーがＸ＝０１のグループでは、最大入力キー数は、２つのキーリストのうち最大のキー数である「２」となる。 Next, in step S34, if the number of elements (number of keys) in the key list of the Map data is larger than the maximum number of input keys in the same group acquired so far, the Reduce processing unit 14 sets the number of keys as the maximum input key. Set to a number. That is, the Reduce processing unit 14 counts the number of elements (number of keys) of the key list of the Map data for the acquired Map data (group), and if it is larger than the maximum number of input keys of the group, the maximum input key of the group Update the number with that value. Therefore, when step S34 is performed for all the map data of the aggregate group, the maximum number of input keys of the group is finally the maximum number of keys in the key list of the input map data group. Will match. For example, in the group with the primary key X = 01 in FIG. 11, the maximum number of input keys is “2” which is the maximum number of keys in the two key lists.

次いで、ステップＳ３６では、Ｒｅｄｕｃｅ処理部１４が、Ｍａｐデータのキーリストの内容をグループのキーリスト（図１２の太破線枠参照）にコピーする。ただし、グループのキーリストに同じキーがすでに含まれている場合、そのキー（重複するキー）については、コピーせず、破棄するものとする。したがって、集約グループの全てのＭａｐデータに対してステップＳ３６を行った場合には、最終的に、各キーリストをマージ（集計）したグループのキーリスト（図１２の太破線枠）が生成されることになる。 Next, in step S36, the Reduce processing unit 14 copies the contents of the key list of the Map data to the group key list (see the thick broken line frame in FIG. 12). However, if the same key is already included in the group key list, the key (duplicate key) is not copied and discarded. Therefore, when step S36 is performed on all the map data of the aggregation group, finally, a key list (a thick broken line frame in FIG. 12) of the group in which the key lists are merged (aggregated) is generated. It will be.

次いで、ステップＳ３８では、Ｒｅｄｕｃｅ処理部１４が、Ｍａｐデータにイベントが含まれていた場合に、そのイベントをイベントリストに追加（コピー）する。イベントに関しては、同一の内容のものが複数存在することは無い（仮に同一の内容だったとしても別のレコードに起因する別データである）。したがって、Ｒｅｄｕｃｅ処理部１４は、内容のチェックを特に行うことなく、イベントリストにイベントをコピーすることができる。 Next, in step S38, if the event is included in the Map data, the Reduce processing unit 14 adds (copies) the event to the event list. Regarding events, there are never a plurality of items having the same contents (even if the contents are the same, they are different data caused by different records). Therefore, the Reduce processing unit 14 can copy the event to the event list without particularly checking the contents.

次いで、ステップＳ４０では、Ｒｅｄｕｃｅ処理部１４が、未処理のＭａｐデータが存在するか否かを判断する。ここでの判断が肯定された場合には、ステップＳ３２に戻る。一方、ステップＳ４０の判断が否定された場合には、ステップＳ４２に移行する。ステップＳ４０の判断が否定された段階では、全てのＭａｐデータの処理が終了しており、各グループの最大入力キー数と、各グループのマージしたキーリストの要素数（キー数）とが求められている。なお、図１２において、各グループの右上に記載されている数値（ａ→ｂ）は、「ａ」が各グループの最大入力キー数を意味し、「ｂ」が各グループのマージしたキーリストの要素数（キー数）を意味する。 Next, in step S40, the Reduce processing unit 14 determines whether or not unprocessed Map data exists. When judgment here is affirmed, it returns to step S32. On the other hand, if the determination in step S40 is negative, the process proceeds to step S42. At the stage where the determination in step S40 is negative, the processing of all the Map data has been completed, and the maximum number of input keys for each group and the number of elements (key number) in the merged key list for each group are obtained. ing. In FIG. 12, in the numerical value (a → b) described in the upper right of each group, “a” means the maximum number of input keys of each group, and “b” indicates the merged key list of each group. Means the number of elements (number of keys).

ステップＳ４０の判断が否定されてステップＳ４２に移行すると、Ｒｅｄｕｃｅ処理部１４は、グループのマージしたキーリストのキー数（ｂ）が、最大入力キー数（ａ）より大きいか否かを判断する。なお、この比較では、最大入力キー数とマージしたキーリストの要素数（キー数）が一致するか、マージしたキーリストの要素数（キー数）の方が大きくなるかのいずれかになる。 If the determination in step S40 is negative and the process proceeds to step S42, the Reduce processing unit 14 determines whether the number of keys (b) in the merged key list of the group is greater than the maximum number of input keys (a). In this comparison, either the maximum number of input keys matches the number of elements in the merged key list (number of keys) or the number of elements in the merged key list (number of keys) becomes larger.

ステップＳ４２の判断が肯定された場合（マージしたキーリストのキー数が、最大入力キー数より大きい場合）には、ステップＳ４４に移行する。なお、図１２では、主キーがＸ＝０３，０５、Ｙ＝０１０１，０１０２，０１０３，０１０４，０１０５、Ｚ＝０１０１０１のグループのいずれかの処理の場合にステップＳ４２の判断が肯定される。このように、マージしたキーリストのキー数の方が多い場合、集約によって新たなキー間の関連が判明したことを意味するので、新しいキー間の関連を利用した再度の集約が必要になる。 If the determination in step S42 is affirmative (if the number of keys in the merged key list is greater than the maximum number of input keys), the process proceeds to step S44. In FIG. 12, the determination in step S 42 is affirmed in the case of processing in any of the groups in which the primary key is X = 03, 05, Y = 0101, 0102, 0103, 0104, 0105, and Z = 0101101. As described above, when the number of keys in the merged key list is larger, it means that the association between new keys is found by the aggregation, so that the aggregation is again performed using the relationship between the new keys.

したがって、マージしたキーリストのキー数の方が多かった場合には、ステップＳ４４において、Ｒｅｄｕｃｅ処理部１４が、管理サーバ２０に通知し、キー数増加フラグ管理部２４が、キー数増加フラグを１に設定する。キー数増加フラグが１に設定された場合、管理サーバ２０の処理管理部２２が必ず再度の集約を実行するので、処理サーバ１０から再集約を指示したのと同じ効果になる。 Accordingly, if the number of keys in the merged key list is larger, in step S44, the Reduce processing unit 14 notifies the management server 20 and the key number increase flag management unit 24 sets the key number increase flag to 1. Set to. When the key number increase flag is set to 1, since the process management unit 22 of the management server 20 always executes re-aggregation, the same effect as when the re-aggregation is instructed from the processing server 10 is obtained.

次いで、ステップＳ４６では、Ｒｅｄｕｃｅ処理部１４は、ａ＜ｂのグループについて、キーリストの要素数（キー数）分のＭａｐデータを作成する。ここで、各Ｍａｐデータのキーは、マージされたキーリストに含まれる各キーになる。また、各Ｍａｐデータのキーリストには、グループのキーリストをそのまま格納する。また、Ｍａｐデータのうち、キーの値が辞書式に先頭になるＭａｐデータ（Ｘ→Ｙ→Ｚの順、かつ数値の小さい順に先頭となるＭａｐデータ）についてのみ、イベントリストにグループのイベントリストをそのまま格納する。すなわち、Ｒｅｄｕｃｅ処理部１２は、マージしたキーリストに含まれるキーのうち辞書式に先頭となるキーを主キーとするマップデータを特定し、当該特定されたマップデータのイベントリストにのみデータを関連付ける Next, in step S46, the Reduce processing unit 14 creates Map data for the number of elements in the key list (number of keys) for the group of a <b. Here, the key of each Map data becomes each key included in the merged key list. Further, the key list of each group is stored as it is in the key list of each Map data. In addition, among the Map data, only the Map data in which the key value is lexicographically headed (Map data heading in the order of X → Y → Z and in ascending order of numerical values), the event list of the group in the event list Store as it is. In other words, the Reduce processing unit 12 specifies map data having a primary key as a primary key among the keys included in the merged key list, and associates the data only with the event list of the specified map data.

そして、ステップＳ４８では、Ｒｅｄｕｃｅ処理部１４が、Ｍａｐデータ群の中間生成データとして、ステップＳ４６で作成されたＭａｐデータを、分散ファイルシステム４０上の中間ファイルに出力する。図１３は、図１０の全処理が終了した直後（１回目の集約処理が終了した直後）の状態を示す図である。この図１３では、一点鎖線で囲まれたＭａｐデータ（のいずれか）が中間ファイルに出力されることになる。なお、この処理は、マージしたキーリストにおいて新たに関連が確定したキーを主キーとするＭａｐデータのグループに対して、マージしたキーリストをコピー配布する処理であるといえる。その後は、図１０の全処理を終了し、図５のステップＳ１６に移行する。 In step S48, the Reduce processing unit 14 outputs the Map data created in step S46 as intermediate generation data of the Map data group to an intermediate file on the distributed file system 40. FIG. 13 is a diagram illustrating a state immediately after all the processes in FIG. 10 are completed (immediately after the first aggregation process is completed). In FIG. 13, Map data (any one) surrounded by a one-dot chain line is output to the intermediate file. This process can be said to be a process of copying and distributing the merged key list to a group of Map data whose primary key is a key that is newly determined in the merged key list. Thereafter, all the processes in FIG. 10 are terminated, and the process proceeds to step S16 in FIG.

なお、出力したＭａｐデータは、再度集約が実行された際、Ｍａｐ処理によって中間ファイルからＭａｐデータとして復元され、次のＲｅｄｕｃｅ処理でキー値に対応した処理サーバ１０に再配布される。従って、２回目以降の集約におけるＭａｐ処理（Ｓ１２）は、入力データファイルからＭａｐデータを生成する初回集約のＭａｐ処理（Ｓ１２）と異なり、中間ファイルにシリアライズされたＭａｐデータをオブジェクトとして復元するだけの処理になる。 The output Map data is restored as Map data from the intermediate file by Map processing when re-aggregation is performed, and is redistributed to the processing server 10 corresponding to the key value in the next Reduce processing. Therefore, the Map process (S12) in the second and subsequent aggregations is different from the Map process (S12) in the first aggregation that generates Map data from the input data file, and only restores the Map data serialized in the intermediate file as an object. It becomes processing.

一方、図１０のステップＳ４２の判断が否定された場合には、ステップＳ５０に移行する。ステップＳ５０では、Ｒｅｄｕｃｅ処理部１４が、イベントリストにイベントが存在するか否かを判断する。ここでの判断が肯定された場合には、ステップＳ５２に移行する。なお、ステップＳ５２に移行するのは、図１２のＸ＝０１，０２，０４のグループのいずれかを処理している場合のみである。 On the other hand, if the determination in step S42 of FIG. 10 is negative, the process proceeds to step S50. In step S50, the Reduce processing unit 14 determines whether an event exists in the event list. When judgment here is affirmed, it transfers to step S52. Note that the process proceeds to step S52 only when any of the groups X = 01, 02, 04 in FIG. 12 is processed.

ステップＳ５２では、キーリストのうち辞書的に先頭となるキーを主キーとして、単一のＭａｐデータを生成する。具体的には、Ｍａｐデータを一つだけ生成し、グループのキーリストのうち、辞書式に先頭となる値をキーとし、キーリストにはグループのキーリストをそのまま格納し、イベントリストにはグループのイベントリストをそのまま格納する。そして、ステップＳ４８に移行すると、Ｒｅｄｕｃｅ処理部１４は、Ｍａｐデータ群の中間生成データとして、ステップＳ５２で生成したＭａｐデータを中間ファイルに出力する。なお、ここで複数のＭａｐデータを生成しないのは、キーリストが増えない場合はキー間の新たな関係が判明しておらず、情報が増えないためにすべての集約グループにキーリストを送付しても無駄になるためである。なお、ステップＳ５２にて生成され、ステップＳ４８にて中間ファイルに出力されるＭａｐデータは、図１３において一点鎖線で囲まれていないＭａｐデータ（のいずれか）である。その後は、図１０の全処理を終了し、図５のステップＳ１６に移行する。 In step S52, single Map data is generated using the key that is lexicographically first in the key list as a main key. Specifically, only one Map data is generated, and the first value in the group key list is used as a key, the key list stores the group key list as it is, and the event list stores the group. Store the event list as it is. In step S48, the Reduce processing unit 14 outputs the Map data generated in step S52 to the intermediate file as intermediate generation data of the Map data group. Note that the multiple Map data is not generated here because if the key list does not increase, the new relationship between the keys is not known and the information does not increase, so the key list is sent to all aggregation groups. This is because it is useless. Note that the Map data generated in Step S52 and output to the intermediate file in Step S48 is Map data (any one) that is not surrounded by a one-dot chain line in FIG. Thereafter, all the processes in FIG. 10 are terminated, and the process proceeds to step S16 in FIG.

これに対し、図１０のステップＳ５０の判断が否定された場合、すなわち、イベントリストにイベントが存在しない場合には、その集約グループでは新たなキーの関連付けは生じなかったことになる。この場合、中間生成データとして何も出力することなく（すなわち、グループを削除して）、図１０の全処理を終了し、図５のステップＳ１６に移行する。ここでステップＳ５０の判断が否定された場合に中間生成データとして何も出力しないこととしているのは、キー数が増えない場合、もしそのキー群に関連したイベントが存在するのであれば、そのイベントが処理されているグループにも同内容のキーリストが送られているはずであるので、そのグループでキーリストを保持すれば良いためである。換言すれば、イベントが存在しないグループではキーリストを捨てても情報が失われることはないためである。 On the other hand, if the determination in step S50 of FIG. 10 is negative, that is, if there is no event in the event list, no new key association has occurred in the aggregation group. In this case, without outputting anything as the intermediate generation data (that is, by deleting the group), the entire processing in FIG. 10 is terminated, and the process proceeds to step S16 in FIG. If the determination in step S50 is negative, nothing is output as intermediate generation data if the number of keys does not increase, and if there is an event related to that key group, that event This is because the key list having the same contents should be sent to the group in which the key is processed, so that the key list should be held in that group. In other words, in a group in which no event exists, information is not lost even if the key list is discarded.

図５に戻り、各処理サーバ１０による処理（ステップＳ１２、Ｓ１４の処理）が完了した後は、図５のステップＳ１６に移行する。ステップＳ１６では、キー数増加フラグ管理部２４が、キー数増加フラグをチェックし、当該フラグが「１」か否かの判断をする。ここでの判断が肯定された場合、すなわち、キー数増加カウンタが「１」の場合、キーの集約が完了していないため、ステップＳ１０に戻り、処理管理部２２の指示の下、各処理サーバ１０が再度同様のMapReduce処理を実行する。 Returning to FIG. 5, after the processing by each processing server 10 (steps S 12 and S 14) is completed, the process proceeds to step S 16 in FIG. 5. In step S16, the key number increase flag management unit 24 checks the key number increase flag and determines whether or not the flag is “1”. If the determination here is affirmative, that is, if the key number increase counter is “1”, the key aggregation has not been completed, so the process returns to step S10, and under the instruction of the process management unit 22, each process server 10 executes the same MapReduce process again.

例えば、２回目のステップＳ１２、Ｓ１４の処理の場合（２回目の集約の場合）、図１４に示すように、図１３のＭａｐデータが集約される。なお、図１５は、図１４に最大入力キー数ａ及びマージしたキーリストのキー数ｂを明示した図である。この図１５に示すように、２回目の集約処理では、主キーがＸ＝０３，０５、Ｙ＝０１０１，０１０２，０１０３，０１０４，０１０５、Ｚ＝０１０１０１の８グループが、マージしたキーリストのキー数（ｂ）が最大入力キー数（ａ）より大きくなっている。したがって、これらのグループに関しては、ステップＳ４８において、図１６，図１７で一点鎖線で囲んで示すＭａｐデータが生成され、中間ファイルとして出力される。 For example, in the case of the second processing in steps S12 and S14 (in the case of the second aggregation), the Map data in FIG. 13 is aggregated as shown in FIG. FIG. 15 is a diagram in which the maximum input key number a and the merged key list key number b are clearly shown in FIG. As shown in FIG. 15, in the second aggregation process, eight groups of primary keys X = 03, 05, Y = 0101, 0102, 0103, 0104, 0105, and Z = 0101101 are merged key list keys. The number (b) is larger than the maximum input key number (a). Accordingly, with respect to these groups, in step S48, Map data indicated by a one-dot chain line in FIGS. 16 and 17 is generated and output as an intermediate file.

一方、主キーがＸ＝０１，０２，０４の３グループでは、ステップＳ４２の判断が否定され、ステップＳ５０の判断が肯定される。このため、ステップＳ５２において、図１６で一点鎖線で囲まれていない３つのＭａｐデータが生成され、中間ファイルとして出力される。また、その他の主キーのグループについては、中間ファイルとして出力されない（削除される）。 On the other hand, in the three groups where the primary key is X = 01, 02, 04, the determination in step S42 is denied and the determination in step S50 is affirmed. For this reason, in step S52, three pieces of Map data not surrounded by the one-dot chain line in FIG. 16 are generated and output as an intermediate file. Other primary key groups are not output (deleted) as intermediate files.

この２回目の集約においても、ステップＳ４２の判断が肯定されることがあるため、キー数増加フラグは「１」に設定される。したがって、３回目の集約も行われることになる（ステップＳ１６が肯定される）。 Even in the second aggregation, the determination in step S42 may be affirmed, so the key number increase flag is set to “1”. Therefore, the third aggregation is also performed (step S16 is affirmed).

３回目の集約では、図１６、図１７のＭａｐデータが、図１８に示すように集約される。この場合、図１８に示すように、主キーがＸ＝０３、Ｙ＝０１０１，０１０２，０１０３、Ｚ＝０１０１０１の５グループが、キーリストの要素数が最大入力キー数より多くなっている（ステップＳ４２の判断が肯定される）。また、主キーがＸ＝０１，０５の２グループでは、ステップＳ４２の判断が否定され、ステップＳ５０の判断が肯定される。 In the third aggregation, the Map data in FIGS. 16 and 17 are aggregated as shown in FIG. In this case, as shown in FIG. 18, the number of elements in the key list is larger than the maximum number of input keys in five groups of primary keys X = 03, Y = 0101, 0102, 0103, and Z = 0101101 (step The determination in S42 is affirmed). Further, in the two groups with the primary key X = 01, 05, the determination in step S42 is denied and the determination in step S50 is affirmed.

次いで、４回目の集約では、上述したのと同様に、ステップＳ４４，Ｓ４６、Ｓ４８、Ｓ５２が実行される。これにより、図１９に示すように、Ｍａｐデータが集約される。 Next, in the fourth aggregation, steps S44, S46, S48, and S52 are executed as described above. Thereby, as shown in FIG. 19, Map data is collected.

次いで、５回目の集約では、図１９に示すように、ステップＳ４２の判断が肯定されるグループ（ａ＜ｂのグループ）が存在しない。一方、ステップＳ５０の判断が肯定されるグループは、主キーがＸ＝０１，０５の２グループである。したがって、５回目の集約では、図２０に示すように、２つのグループに集約されることになる。 Next, in the fifth aggregation, as shown in FIG. 19, there is no group (group of a <b) for which the determination in step S42 is affirmed. On the other hand, the groups for which the determination in step S50 is affirmative are two groups whose primary keys are X = 01,05. Therefore, in the fifth aggregation, as shown in FIG. 20, the two groups are aggregated.

この５回目の集約では、上述のようにステップＳ４２が肯定されないため、キー数増加フラグは「０」である。このため、図５のステップＳ１６の判断が否定される。キー数増加フラグが「０」の場合、関連のあるキー群の集約が完了したことを意味するので、キー集約のループ処理（ステップＳ１０〜Ｓ１６）は完了し、図５のステップＳ１８に移行することになる。 In the fifth aggregation, since step S42 is not affirmed as described above, the key number increase flag is “0”. For this reason, the determination of step S16 in FIG. 5 is denied. When the key number increase flag is “0”, it means that the aggregation of the related key group is completed, so the key aggregation loop process (steps S10 to S16) is completed, and the process proceeds to step S18 in FIG. It will be.

ステップＳ１８では、Ｍａｐ処理部１２が、最終集約処理のＭａｐ処理を実行する。具体的には、最後集約処理として、Ｍａｐ処理部１２は、ループの最後のＲｅｄｕｃｅ処理において中間ファイルに出力したＭａｐデータ（図２１（ａ））を入力とし、最終のＭａｐ処理を実行する。このＭａｐ処理では、Ｍａｐ処理部１２は、中間ファイルのＭａｐデータを入力とし、各Ｍａｐデータのキーリストをキーとし、イベントリストを値とした新たなＭａｐデータを作成し、また、元のＭａｐデータを破棄する。なお、「Ｍａｐデータのキーリストをキーとする」とは、キーリストの各キーを羅列して新たなキーを作成することを意味する。すなわち、図２１（ａ）に示すようなＭａｐデータからは、図２１（ｂ）に示すようなキー（主キー）で新たなＭａｐデータが生成される。 In step S18, the Map processing unit 12 executes the Map process of the final aggregation process. Specifically, as the final aggregation process, the Map processing unit 12 receives the Map data (FIG. 21A) output to the intermediate file in the last Reduce process of the loop, and executes the final Map process. In this Map processing, the Map processing unit 12 receives the Map data of the intermediate file as input, creates new Map data with the key list of each Map data as a key, and the event list as a value, and the original Map data. Is discarded. Note that “using the key list of Map data as a key” means creating a new key by enumerating each key in the key list. That is, new Map data is generated from the Map data as shown in FIG. 21A with the key (primary key) as shown in FIG.

次いで、ステップＳ２０では、Ｒｅｄｕｃｅ処理部１４が、ステップＳ１８のＭａｐ処理で作成された新たなＭａｐデータを対象として、最終集約処理のＲｅｄｕｃｅ処理を実行する。Ｒｅｄｕｃｅ処理では、集めたＭａｐデータのイベントリストから全イベントデータを抽出する。これらのイベントデータからイベントの種類やタイムスタンプの情報を抽出し、イベントを時系列に並べれば、各集約グループにつき１つずつのフローインスタンスが完成する。 Next, in Step S20, the Reduce processing unit 14 executes the Reduce process of the final aggregation process for the new Map data created in the Map process of Step S18. In the Reduce process, all event data is extracted from the event list of the collected Map data. If event type and time stamp information is extracted from these event data and the events are arranged in time series, one flow instance is completed for each aggregation group.

なお、最終集約で関連するイベントを集約するので、ループ内部の集約においては、イベントの移動は必須ではない。しかし、ループの内部でイベントを移動させ、段階的にイベントを集めた方が、イベントの統計処理等をＲｅｄｕｃｅ処理の内部で段階的に行わせることなどが可能になり、メリットが大きい。 Since related events are aggregated in the final aggregation, movement of events is not essential in the aggregation within the loop. However, moving events within a loop and collecting events in stages makes it possible to perform statistical processing of events in stages within Reduce processing, and so on, which has a great merit.

以上の処理により、図５の全処理が終了すると、図２１（ｂ）のように、複数キーのデータを正しく集計することができる。 When all the processes in FIG. 5 are completed by the above processing, data of a plurality of keys can be correctly totaled as shown in FIG.

なお、ステップＳ１８、Ｓ２０の処理は、図２０のように、ステップＳ１８以前の処理で全てのイベント（データ）を正しく集計できるような場合には行わなくてもよい。ただし、実際には、ステップＳ１８、Ｓ２０の処理を行わないと全てのデータを正しく集計できない可能性もあるので、本実施形態では、ステップＳ１８、Ｓ２０の処理を常に行うこととしている。 Note that the processing in steps S18 and S20 may not be performed when all events (data) can be correctly counted in the processing before step S18 as shown in FIG. However, in practice, since there is a possibility that all data cannot be correctly summed up unless the processes of steps S18 and S20 are performed, in this embodiment, the processes of steps S18 and S20 are always performed.

以上詳細に説明したように、本実施形態によると、複数のキー種で分類された複数のデータを集約する処理サーバ１０が、データ毎に関連するキーを集めたリストであるキーリスト、及び当該キーのうちの１つのキーである主キーを関連付けたＭａｐデータを生成して分散ファイルシステム４０に記憶し、分散ファイルシステム４０を参照して、同一の主キーに関連付けられたマップデータのグループのキーリストを取得し、該取得したキーリストに含まれるキー数の中で最大のキー数（最大入力キー数）と、取得した全てのキーリストをマージした合計キー数と、を比較し、比較の結果、合計キー数の方が大きかった場合に、当該マージしたキーリストに含まれるキーそれぞれを主キーとし、キーリストとしてマージしたキーリストを関連付けた新たなＭａｐデータを、マージしたキーリストに含まれるキーの数分だけ生成して分散ファイルシステム４０に記憶する。これにより、マージしたキーリストを、当該マージしたキーリストに含まれるキー（関連することが確定した複数のキー）それぞれを主キーとするＭａｐデータのグループにコピー配布することができる（ステップＳ４６）。このようにすることで、ＲＤＢ（関係データベース）を用いなくとも、Ｍａｐデータ（特にキーリスト）を参照するのみで、関連するキーを集約することができる。したがって、複数のキー種で分類された複数のデータを集約する際に、性能・スケーラビリティ改善効果を得ることが可能となる。また、新たに生成されるＭａｐデータが、各処理サーバにおいてＲｅｄｕｃｅ処理されるという処理が繰り返されるので、データ集約の漏れをなくすことができる。また、本実施形態では、同一の主キーに関連付けられたキーリストのキー数が増える場合に、マージしたキーリストに含まれるキーそれぞれを主キーとして新たなＭａｐデータが生成されるので、同時に分散実行される集約処理数を大きくすることができる。これにより、少ない回数のMapReduce処理の繰り返しで、関連するキーを集約することができる。 As described above in detail, according to the present embodiment, the processing server 10 that aggregates a plurality of data classified by a plurality of key types includes a key list that is a list of related keys for each data, and Map data associated with a primary key, which is one of the keys, is generated and stored in the distributed file system 40. With reference to the distributed file system 40, a map data group associated with the same primary key is generated. Obtain a key list and compare the maximum number of keys (maximum number of input keys) among the number of keys included in the obtained key list with the total number of keys that merged all the obtained key lists. As a result, if the total number of keys is larger, each key included in the merged key list is set as the primary key, and the key list merged as the key list is related. Digit new Map data generated by the number of the keys contained in the merged key list stored in the distributed file system 40. As a result, the merged key list can be copied and distributed to a group of Map data in which the keys included in the merged key list (a plurality of keys determined to be related) are primary keys (step S46). . By doing so, it is possible to aggregate related keys only by referring to the Map data (particularly the key list) without using an RDB (relational database). Therefore, when a plurality of data classified by a plurality of key types are aggregated, it is possible to obtain a performance / scalability improvement effect. In addition, since the newly generated Map data is subjected to the Reduce process in each processing server, omission of data aggregation can be eliminated. Further, in this embodiment, when the number of keys in the key list associated with the same primary key increases, new Map data is generated using each key included in the merged key list as the primary key. The number of aggregation processes to be executed can be increased. Thereby, related keys can be aggregated by repeating the MapReduce process a small number of times.

また、本実施形態によると、Ｍａｐデータは、データの内容のリストであるイベントリストを含んでおり、取得したキーリストに含まれるキー数の中で最大のキー数（最大入力キー数）と、取得した全てのキーリストをマージした合計キー数と、を比較した結果、合計キー数の方が大きかった場合に、マージしたキーリストに含まれるキーのうち辞書式に先頭となるキーを主キーとするＭａｐデータを特定し、当該特定されたＭａｐデータのイベントリストにのみデータを関連付ける。これにより、関連するキーの集約及びイベント（データ）の集約を適切に行うことができるので、集約を少ない回数のMapReduce処理の繰り返しで行うことができる。 In addition, according to the present embodiment, the Map data includes an event list that is a list of data contents, and the maximum number of keys (maximum number of input keys) among the number of keys included in the acquired key list, If the total number of keys is greater as a result of comparing the total number of keys merged for all the acquired key lists, the lexicographically leading key in the merged key list is the primary key. Map data is specified, and the data is associated only with the event list of the specified Map data. Accordingly, aggregation of related keys and aggregation of events (data) can be appropriately performed, so that aggregation can be performed by repeating the MapReduce process a small number of times.

また、本実施形態によると、合計キー数の方が最大入力キー数よりも大きくなかった場合（合計キー数と最大入力キー数が一致した場合）であって、取得したイベントリストにデータが含まれる場合には、マージしたキーリストのうち辞書式に先頭となるキーを主キーとし、取得したイベントリストをマージしたイベントリスト及びマージしたキーリストが関連付けられた新たなマップデータを生成する。これにより、本実施形態では、データ（イベント）に関連しないＭａｐデータの必要以上の集約処理の発生を抑制することができる。 Further, according to the present embodiment, when the total number of keys is not larger than the maximum number of input keys (when the total number of keys matches the maximum number of input keys), the acquired event list includes data. In this case, a key that is lexicographically first in the merged key list is used as a primary key, and an event list obtained by merging the acquired event list and new map data associated with the merged key list are generated. Thereby, in this embodiment, generation | occurrence | production of the aggregation process more than necessary of the Map data which is not related to data (event) can be suppressed.

また、本実施形態によると、分散ファイルシステム４０に記憶されている全ての主キーに関連付けられたグループのキーリストにおいて、合計キー数の方が最大入力キー数よりも大きくなかった場合（合計キー数と最大入力キー数が一致した場合）に、分散ファイルシステム４０に記憶されているキーリストに基づいて一意の値を作成し、当該値を主キーとするＭａｐデータを生成する。これにより、全グループにおいて合計キー数が最大入力キー数よりも大きくなくなるまでの間に全てのデータを正しく集約できなかった場合でも、一意の値を主キーとしてＭａｐデータを生成することで、正しく集約できなかったデータを正しく集約することができる。 Further, according to the present embodiment, when the total number of keys is not larger than the maximum number of input keys in the group key list associated with all primary keys stored in the distributed file system 40 (total keys). When the number matches the maximum number of input keys), a unique value is created based on the key list stored in the distributed file system 40, and Map data using the value as a primary key is generated. As a result, even if all data cannot be correctly aggregated before the total number of keys in all groups becomes larger than the maximum number of input keys, it is possible to correctly generate Map data using a unique value as a primary key. Data that could not be aggregated can be aggregated correctly.

ここで、本出願の発明者（及び出願人）は、特願２０１１−０５０７４５号（以下、「先願」と呼ぶ）において、本願と同一の課題を解決するための発明について出願している。本実施形態では、同時に分散実行される集約処理数を大きくすることで、上述したように集約の回数を極力減らすことができる。このため、本実施形態では、特に、多数の処理サーバ１０から構成される処理システムを、あるデータを対象にした単一の集約処理で占有できる場合、あるいは、複数の集約処理を行う場合でも処理システムの性能に余裕がある場合に、最短の処理時間で処理を終えることが可能である。これに対し、先願では、集約の回数は本実施形態に比べて多少多いものの、１回の集約における処理量は少ないので、ある処理システムにおいて複数のデータを対象にした集約処理を同時に実行する場合で、かつ処理システムの性能に余裕が無い場合に、最高のスループット性能を得ることが可能である。すなわち、本実施形態と先願とは、処理システムの規模や、処理システムに対する利用者の数などの条件に応じて適宜使い分けることができる。 Here, the inventor (and the applicant) of the present application has applied for an invention for solving the same problem as in the present application in Japanese Patent Application No. 2011-050745 (hereinafter referred to as “prior application”). In the present embodiment, by increasing the number of aggregation processes that are simultaneously distributed and executed, the number of aggregations can be reduced as much as possible. Therefore, in the present embodiment, in particular, even when a processing system composed of a large number of processing servers 10 can be occupied by a single aggregation process for a certain data, or even when a plurality of aggregation processes are performed. When there is a margin in system performance, processing can be completed in the shortest processing time. On the other hand, in the prior application, although the number of times of aggregation is slightly larger than that of the present embodiment, the amount of processing in one aggregation is small, so that aggregation processing for a plurality of data is simultaneously executed in a certain processing system. In some cases and when there is no margin in processing system performance, it is possible to obtain the highest throughput performance. In other words, the present embodiment and the prior application can be appropriately used according to conditions such as the scale of the processing system and the number of users for the processing system.

なお、上記実施形態では、管理サーバ２０を設ける場合について説明したが、これに限らず、管理サーバ２０と同様の処理を処理サーバ１０のいずれかが行うこととしてもよい。 In addition, although the case where the management server 20 was provided was demonstrated in the said embodiment, not only this but the process similar to the management server 20 is good also as any of the processing servers 10 performing.

なお、上記の処理機能は、コンピュータによって実現することができる。その場合、処理装置が有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。 The above processing functions can be realized by a computer. In that case, a program describing the processing contents of the functions that the processing apparatus should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium.

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ（Digital Versatile Disc）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）などの可搬型記録媒体の形態で販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When the program is distributed, for example, it is sold in the form of a portable recording medium such as a DVD (Digital Versatile Disc) or a CD-ROM (Compact Disc Read Only Memory) on which the program is recorded. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

上述した実施形態は本発明の好適な実施の例である。但し、これに限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変形実施可能である。 The above-described embodiment is an example of a preferred embodiment of the present invention. However, the present invention is not limited to this, and various modifications can be made without departing from the scope of the present invention.

なお、以上の説明に関して更に以下の付記を開示する。
（付記１）複数のキー種で分類された複数のデータを集約するコンピュータに、
前記データ毎に関連するキーを集めたリストであるキーリスト、及び当該キーのうちの１つのキーである集約キーを関連付けたマップデータを生成して記憶部に記憶し、
前記記憶部を参照して、同一の集約キーに関連付けられたマップデータのグループのキーリストを取得し、該取得したキーリストに含まれるキー数の中で最大のキー数と、取得した全てのキーリストをマージした合計キー数と、を比較し、
前記比較の結果、前記合計キー数の方が大きかった場合に、当該マージしたキーリストに含まれるキーそれぞれを集約キーとし、キーリストとして前記マージしたキーリストを関連付けた新たなマップデータを、前記マージしたキーリストに含まれるキーの数分だけ生成して前記記憶部に記憶する、処理を実行させることを特徴とする処理プログラム。
（付記２）前記マップデータは、前記データの内容のリストであるイベントリストを含み、
前記新たなマップデータを生成して記憶する処理では、前記比較の結果、前記合計キー数の方が大きかった場合に、前記マージしたキーリストに含まれるキーのうち辞書式に先頭となるキーを集約キーとするマップデータを特定し、当該特定されたマップデータのイベントリストにのみデータを関連付けることを特徴とする付記１に記載の処理プログラム。
（付記３）前記比較する処理における比較の結果、前記合計キー数の方が大きくなかった場合であって、前記取得したイベントリストにデータが含まれる場合には、前記マージしたキーリストのうち辞書式に先頭となるキーを集約キーとし、前記取得したイベントリストをマージしたイベントリスト及び前記マージしたキーリストが関連付けられた新たなマップデータを生成し、前記記憶部に記憶することを特徴とする付記１又は２に記載の処理プログラム。
（付記４）前記比較する処理における比較の結果、前記記憶部に記憶されている全ての集約キーに関連付けられたキーリストにおいて、前記合計キー数の方が大きくなかった場合に、前記記憶部に記憶されているキーリストに基づいて一意の値を作成し、当該値を集約キーとするマップデータを生成する処理、を前記コンピュータに実行させることを特徴とする付記１〜３のいずれかに記載の処理プログラム。
（付記５）複数のキー種で分類された複数のデータを集約するコンピュータが、
前記データ毎に関連するキーを集めたリストであるキーリスト、及び当該キーのうちの１つのキーである集約キーを関連付けたマップデータを生成して記憶部に記憶する工程と、
前記記憶部を参照して、同一の集約キーに関連付けられたマップデータのグループのキーリストを取得し、該取得したキーリストに含まれるキー数の中で最大のキー数と、取得した全てのキーリストをマージした合計キー数と、を比較する工程と、
前記比較する工程における比較の結果、前記合計キー数の方が大きかった場合に、当該マージしたキーリストに含まれるキーそれぞれを集約キーとし、キーリストとして前記マージしたキーリストを関連付けた新たなマップデータを、前記マージしたキーリストに含まれるキーの数分だけ生成して前記記憶部に記憶する工程と、を実行することを特徴とする処理方法。
（付記６）前記マップデータは、前記データの内容のリストであるイベントリストを含み、
前記新たなマップデータを生成して記憶する工程では、前記比較の結果、前記合計キー数の方が大きかった場合に、前記マージしたキーリストに含まれるキーのうち辞書式に先頭となるキーを集約キーとするマップデータを特定し、当該特定されたマップデータのイベントリストにのみデータを関連付けることを特徴とする付記５に記載の処理方法。
（付記７）前記比較する工程における比較の結果、前記合計キー数の方が大きくなかった場合であって、前記取得したイベントリストにデータが含まれる場合には、前記マージしたキーリストのうち辞書式に先頭となるキーを集約キーとし、前記取得したイベントリストをマージしたイベントリスト及び前記マージしたキーリストが関連付けられた新たなマップデータを生成し、前記記憶部に記憶する工程を、前記コンピュータが実行することを特徴とする付記５又は６に記載の処理方法。
（付記８）前記比較する処理における比較の結果、前記記憶部に記憶されている全ての集約キーに関連付けられたキーリストにおいて、前記合計キー数の方が大きくなかった場合に、前記記憶部に記憶されているキーリストに基づいて一意の値を作成し、当該値を集約キーとするマップデータを生成する工程、を前記コンピュータが実行することを特徴とする付記５〜７のいずれかに記載の処理方法。
（付記９）複数のキー種で分類された複数のデータを集約する処理装置であって、
前記複数のデータ及び当該複数のデータから生成されるマップデータを記憶する記憶部と、
前記データ毎に関連するキーを集めたリストであるキーリスト、及び当該キーのうちの１つのキーである集約キーを関連付けたマップデータを生成して前記記憶部に記憶する第１生成・記憶部と、
前記記憶部を参照して、同一の集約キーに関連付けられたマップデータのグループのキーリストを取得し、該取得したキーリストに含まれるキー数の中で最大のキー数と、取得した全てのキーリストをマージした合計キー数と、を比較する比較部と、
前記比較する工程における比較の結果、前記合計キー数の方が大きかった場合に、当該マージしたキーリストに含まれるキーそれぞれを集約キーとし、キーリストとして前記マージしたキーリストを関連付けた新たなマップデータを、前記マージしたキーリストに含まれるキーの数分だけ生成して前記記憶部に記憶する第２生成・記憶部と、を備える処理装置。
（付記１０）前記マップデータは、前記データの内容のリストであるイベントリストを含み、
前記第２生成・記憶部では、前記比較部による比較の結果、前記合計キー数の方が大きかった場合に、前記マージしたキーリストに含まれるキーのうち辞書式に先頭となるキーを集約キーとするマップデータを特定し、当該特定されたマップデータのイベントリストにのみデータを関連付けることを特徴とする付記９に記載の処理装置。
（付記１１）前記比較部による比較の結果、前記合計キー数の方が大きくなかった場合であって、前記取得したイベントリストにデータが含まれる場合に、前記マージしたキーリストのうち辞書式に先頭となるキーを集約キーとし、前記取得したイベントリストをマージしたイベントリスト及び前記マージしたキーリストが関連付けられた新たなマップデータを生成し、前記記憶部に記憶する第３生成・記憶部を、備える付記９又は１０に記載の処理装置。
（付記１２）前記比較部による比較の結果、前記記憶部に記憶されている全ての集約キーに関連付けられたキーリストにおいて、前記合計キー数の方が大きくなかった場合に、前記記憶部に記憶されているキーリストに基づいて一意の値を作成し、当該値を集約キーとするマップデータを生成する生成部を備える付記９〜１１のいずれかに記載の処理装置。 In addition, the following additional notes are disclosed regarding the above description.
(Supplementary note 1) To a computer that aggregates multiple data classified by multiple key types,
A key list that is a list of keys associated with each data, and map data that associates an aggregate key that is one of the keys, and stores the map data in a storage unit;
Referring to the storage unit, obtain a key list of a group of map data associated with the same aggregate key, and obtain the maximum number of keys among the number of keys included in the obtained key list and all the obtained keys Compare the total number of keys merged with the key list,
As a result of the comparison, if the total number of keys is larger, each key included in the merged key list is set as an aggregate key, and new map data associating the merged key list as a key list, A processing program for generating a number of keys included in a merged key list and storing it in the storage unit.
(Supplementary Note 2) The map data includes an event list that is a list of contents of the data,
In the process of generating and storing the new map data, if the total number of keys is larger as a result of the comparison, a key that is lexicographically leading among the keys included in the merged key list is selected. The processing program according to appendix 1, wherein map data as an aggregation key is specified, and data is associated only with an event list of the specified map data.
(Supplementary Note 3) If the total number of keys is not larger as a result of the comparison in the comparison process and the acquired event list includes data, the dictionary is included in the merged key list. The first key in the formula is used as an aggregate key, and an event list obtained by merging the acquired event list and new map data associated with the merged key list are generated and stored in the storage unit. The processing program according to appendix 1 or 2.
(Supplementary Note 4) In the key list associated with all the aggregate keys stored in the storage unit as a result of the comparison in the comparison process, the total number of keys is not larger in the storage unit. Any one of appendices 1 to 3, wherein the computer is caused to execute a process of creating a unique value based on a stored key list and generating map data using the value as an aggregate key. Processing program.
(Supplementary Note 5) A computer that aggregates a plurality of data classified by a plurality of key types,
Generating a key list that is a list of related keys for each data and a map data that associates an aggregate key that is one of the keys and storing the map data in a storage unit; and
Referring to the storage unit, obtain a key list of a group of map data associated with the same aggregate key, and obtain the maximum number of keys among the number of keys included in the obtained key list and all the obtained keys A step of comparing the total number of keys merged with the key list;
If the total number of keys is larger as a result of the comparison in the comparing step, each key included in the merged key list is set as an aggregate key, and a new map in which the merged key list is associated as a key list And generating data corresponding to the number of keys included in the merged key list and storing the data in the storage unit.
(Supplementary Note 6) The map data includes an event list that is a list of contents of the data,
In the step of generating and storing the new map data, if the total number of keys is larger as a result of the comparison, a lexicographically leading key is included in the keys included in the merged key list. 6. The processing method according to appendix 5, wherein map data as an aggregation key is specified, and data is associated only with an event list of the specified map data.
(Supplementary Note 7) If the total key number is not larger as a result of the comparison in the comparing step and the acquired event list includes data, the dictionary is included in the merged key list. A step of generating a new map data associated with the event list obtained by merging the acquired event list and the merged key list, using the head key in the formula as an aggregate key, and storing the map in the storage unit; The processing method according to appendix 5 or 6, wherein:
(Supplementary Note 8) In the key list associated with all the aggregate keys stored in the storage unit as a result of the comparison in the comparison process, if the total key number is not larger, the storage unit stores The computer executes a step of creating a unique value based on a stored key list and generating map data using the value as an aggregate key. Processing method.
(Supplementary note 9) A processing device that aggregates a plurality of data classified by a plurality of key types,
A storage unit for storing the plurality of data and map data generated from the plurality of data;
A first generation / storage unit that generates a map list in which a key list that is a list of related keys for each piece of data and an aggregate key that is one of the keys is associated and is stored in the storage unit When,
Referring to the storage unit, obtain a key list of a group of map data associated with the same aggregate key, and obtain the maximum number of keys among the number of keys included in the obtained key list and all the obtained keys A comparison unit for comparing the total number of keys merged with the key list;
If the total number of keys is larger as a result of the comparison in the comparing step, each key included in the merged key list is set as an aggregate key, and a new map in which the merged key list is associated as a key list A processing device comprising: a second generation / storage unit that generates data for the number of keys included in the merged key list and stores the data in the storage unit.
(Supplementary Note 10) The map data includes an event list that is a list of contents of the data,
In the second generation / storage unit, when the total number of keys is larger as a result of the comparison by the comparison unit, the lexically leading key among the keys included in the merged key list is an aggregate key. The processing apparatus according to appendix 9, wherein the map data is specified and the data is associated only with the event list of the specified map data.
(Supplementary Note 11) As a result of comparison by the comparison unit, when the total number of keys is not larger and the acquired event list includes data, the merged key list is lexicographically. A third generation / storage unit that generates a new map data associated with the event list obtained by merging the acquired event list and the merged key list using the leading key as an aggregate key, and stores the generated map data in the storage unit; The processing apparatus according to appendix 9 or 10, provided.
(Supplementary Note 12) When the total number of keys is not larger in the key list associated with all the aggregate keys stored in the storage unit as a result of the comparison by the comparison unit, it is stored in the storage unit. The processing apparatus according to any one of appendices 9 to 11, further comprising a generation unit that generates a unique value based on the key list that is generated and generates map data using the value as an aggregate key.

１０処理サーバ（コンピュータ、処理装置）
１２Ｍａｐ処理部（第１生成・記憶部）
１４Ｒｅｄｕｃｅ処理部（比較部、第２生成・記憶部）
４０分散ファイルシステム（記憶部） 10 processing server (computer, processing device)
12 Map processing unit (first generation / storage unit)
14 Reduce processing unit (comparison unit, second generation / storage unit)
40 Distributed file system (storage unit)

Claims

To a computer that aggregates multiple data classified by multiple key types,
A key list that is a list of keys associated with each data, and map data that associates an aggregate key that is one of the keys, and stores the map data in a storage unit;
Referring to the storage unit, obtain a key list of a group of map data associated with the same aggregate key, and obtain the maximum number of keys among the number of keys included in the obtained key list and all the obtained keys Compare the total number of keys merged with the key list,
As a result of the comparison, if the total number of keys is larger, each key included in the merged key list is set as an aggregate key, and new map data associating the merged key list as a key list, As many as the number of keys included in the merged key list are generated and stored in the storage unit.
A processing program characterized by causing processing to be executed.

The map data includes an event list that is a list of contents of the data;
In the process of generating and storing the new map data, if the total number of keys is larger as a result of the comparison, a key that is lexicographically leading among the keys included in the merged key list is selected. 2. The processing program according to claim 1, wherein map data as an aggregation key is specified, and data is associated only with an event list of the specified map data.

As a result of the comparison in the comparison process, when the total key number is not larger and the acquired event list includes data, the merged key list includes a lexicographic top The key is a consolidated key, and an event list obtained by merging the acquired event list and new map data associated with the merged key list are generated and stored in the storage unit. 2. The processing program according to 2.

When the total number of keys is not larger in the key list associated with all the aggregate keys stored in the storage unit as a result of the comparison in the comparison process, the key list is stored in the storage unit. A process for creating a unique value based on the key list and generating map data using the value as an aggregate key,
The processing program according to any one of claims 1 to 3, wherein the computer is executed.

A computer that aggregates multiple data classified by multiple key types,
Generating a key list that is a list of related keys for each data and a map data that associates an aggregate key that is one of the keys and storing the map data in a storage unit; and
Referring to the storage unit, obtain a key list of a group of map data associated with the same aggregate key, and obtain the maximum number of keys among the number of keys included in the obtained key list and all the obtained keys A step of comparing the total number of keys merged with the key list;
If the total number of keys is larger as a result of the comparison in the comparing step, each key included in the merged key list is set as an aggregate key, and a new map in which the merged key list is associated as a key list And generating data corresponding to the number of keys included in the merged key list and storing the data in the storage unit.

A processing device that aggregates a plurality of data classified by a plurality of key types,
A storage unit for storing the plurality of data and map data generated from the plurality of data;
A first generation / storage unit that generates a map list in which a key list that is a list of related keys for each piece of data and an aggregate key that is one of the keys is associated and is stored in the storage unit When,
Referring to the storage unit, obtain a key list of a group of map data associated with the same aggregate key, and obtain the maximum number of keys among the number of keys included in the obtained key list and all the obtained keys A comparison unit for comparing the total number of keys merged with the key list;
If the total number of keys is larger as a result of the comparison in the comparing step, each key included in the merged key list is set as an aggregate key, and a new map in which the merged key list is associated as a key list A processing device comprising: a second generation / storage unit that generates data for the number of keys included in the merged key list and stores the data in the storage unit.