JP2012160013A

JP2012160013A - Data analysis and machine learning processing unit, method, and program

Info

Publication number: JP2012160013A
Application number: JP2011019172A
Authority: JP
Inventors: Keishi Fukumoto; 佳史福本; Makoto Onizuka; 真鬼塚
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-01-31
Filing date: 2011-01-31
Publication date: 2012-08-23
Anticipated expiration: 2031-01-31
Also published as: JP5552449B2

Abstract

PROBLEM TO BE SOLVED: To reuse intermediate data utilized in a past processing in Map Reduce.SOLUTION: Mapper and Reducer detect, during processing of data which is input, a parameter value, a use position, and a usage source function from the data, as information relating to a parameter utilized in a mapping means and a reducing means, store them in a distributed file storage means, refer to a parameter given to a job, collate it with information relative to the parameter stored in the distributed file storage means and store intermediate data as a cache in the distributed file storage means or a local storage means if a prescribed condition is satisfied, and refer to the parameter given to the job and make the Reducer execute if the intermediate data is stored in the distributed file storage means or the local storage means.

Description

本発明は、データ分析及び機械学習処理装置及び方法及びプログラムに係り、特に、大規模なデータの分析処理のための機械学習処理の効率化を図るためのデータ分析及び機械学習処理装置及び方法及びプログラムに関する。 The present invention relates to a data analysis and machine learning processing apparatus, method, and program, and more particularly, to a data analysis and machine learning processing apparatus and method for improving the efficiency of machine learning processing for analysis processing of large-scale data, and Regarding the program.

大規模データの分析処理のための技術として"MapReduce"がある（例えば、非特許文献１参照）。 As a technique for analyzing large-scale data, there is “MapReduce” (for example, see Non-Patent Document 1).

図２３は、マップリデュース（MapReduce）処理の流れを示す。MapReduceはネットワークによって相互に接続された複数のコンピュータを用いる。MapReduceは図２３に示すような流れで処理が行われる分散処理フレームワークである。 FIG. 23 shows the flow of map reduce processing. MapReduce uses multiple computers connected together by a network. MapReduce is a distributed processing framework in which processing is performed as shown in FIG.

MapReduce処理が開始されると、まずは各コンピュータにおいて、ユーザが任意に定義した"Mapper"が起動され、予め分散ファイルシステム（HDFS）に格納されていた分散データを各々のコンピュータが読み込み、マップ（Map）処理が行われる。 When the MapReduce process is started, the "Mapper" arbitrarily defined by the user is started on each computer, each computer reads the distributed data stored in advance in the distributed file system (HDFS), and the map (Map ) Processing is performed.

Map処理では、各コンピュータが自らに割り当てられた分散データを先頭から順に１行に対して１回、ユーザが定義したMap関数が適用される。Map関数において、処理の結果として複数のキー・値（Key-Value）形式のレコードの集合を中間データとして出力する。なお、Key・Valueそれぞれの型は一定制約下でユーザが任意に定義するものとする。 In the map processing, a map function defined by the user is applied once for each row in order from the top to the distributed data assigned to each computer. In the Map function, a set of records in a key-value format is output as intermediate data as a result of processing. Note that the key and value types are arbitrarily defined by the user under certain restrictions.

次に、各コンピュータにおいて各ユーザが任意に定義したReducerが起動され、中間データのKey部が同じレコードが１台のコンピュータに集められるように、ネットワークを介して中間データを相互に移動させる。これを"Shuffle処理"と呼ぶ。 Next, the Reducer arbitrarily defined by each user is activated in each computer, and the intermediate data is moved to each other via the network so that records having the same key part of the intermediate data are collected in one computer. This is called “Shuffle processing”.

同じKeyを持つ中間データのレコードは最終的にソートされた状態で、Valueはイテレータとしてユーザが定義したリデュース（Reduce）関数に与えられる。Reduce関数において、処理の結果として複数のKey-Value形式のレコードの集合が出力される。 Records of intermediate data with the same key are finally sorted, and Value is given to the reduce function defined by the user as an iterator. The Reduce function outputs a set of records in multiple key-value formats as a result of processing.

図２４にMapReduceを利用した一般的なアプリケーションの処理フローを示す。 FIG. 24 shows a processing flow of a general application using MapReduce.

ステップ１００）初期設定ファイルを入力し、MapReduceを開始するためにジョブを生成する。 Step 100) Input an initial setting file and generate a job to start MapReduce.

ステップ１１０）生成されたジョブに対して利用するマッパー（Mapper）・リデューサー（Reducer）やReducerの数、その他のMapReduce処理そのものに必要な設定や、ユーザが定義するMapper・Reducer等の内部で利用するパラメータ、コマンドライン引数などの解析により実行時に変更可能なパラメータをMapReduceジョブに与える。 Step 110) The number of mappers, reducers and reducers to be used for the generated job, other settings required for the MapReduce process itself, and the mapper and reducer defined by the user Give parameters that can be changed at runtime to the MapReduce job by analyzing parameters, command line arguments, etc.

ステップ１２０） MapReduce処理を行う。詳細は図２５で述べる。 Step 120) MapReduce processing is performed. Details will be described with reference to FIG.

ステップ１３０） MapReduce処理によって出力された結果に対して何らかの処理（ユーザが定義した処理）を行う。 Step 130) Any processing (processing defined by the user) is performed on the result output by the MapReduce processing.

次に、図２４のステップ１２０のMapReduce処理について説明する。MapReduce処理は複数のノード（コンピュータ）をネットワークで相互に接続したクラスタ上で行う分散処理フレームワークである。 Next, the MapReduce process in step 120 of FIG. 24 will be described. MapReduce processing is a distributed processing framework that runs on a cluster in which multiple nodes (computers) are connected to each other via a network.

図２５は、一般的なMapReduce処理のフローを示す。 FIG. 25 shows a flow of general MapReduce processing.

ステップ２００） MapReduce処理が開始されると、まずは、各ノードにおいてユーザが任意に定義したMapperが起動され、予め分散ファイルシステムに格納されていた分散データを各々のノードが読み込み、Map処理が行われる。このとき、ユーが定義した任意のMap処理を行う。各ノードが自らに割り当てた分散データを先頭から順に１行に対して１回、ユーザが定義したMap関数が適用され、Key-Value形式の任意の中間データを出力する。 Step 200) When MapReduce processing is started, first, Mapper arbitrarily defined by the user is started in each node, each node reads the distributed data stored in the distributed file system in advance, and Map processing is performed. . At this time, arbitrary Map processing defined by you is performed. A map function defined by the user is applied once to each row of the distributed data assigned to each node in order from the top, and any intermediate data in the key-value format is output.

ステップ２１０）ユーザがコンバイン（Combine）に利用するクラスを明示的に指定している場合のみ、各ノードにおいてMap処理が終わり次第、次にコンバイナー（Combiner）が起動され、それぞれのMap処理によって出力された中間データを対象として、キーが共通である中間データを一つにまとめるCombine処理（ローカルでのReduce）が行われ、Key・Valueリストの形をとった任意の複数の中間データが出力される。 Step 210) Only when the user explicitly specifies the class to be used for the combine (Combine), as soon as the map processing is completed at each node, the combiner (combiner) is started and output by the respective map processing. Combined processing (local reduction) that combines intermediate data with a common key into a single target is performed, and multiple intermediate data in the form of a Key / Value list are output. .

ステップ２２０）各ノードにおいてユーザが任意に定義したReducerが起動され、中間データのキー部が同じレコードが１台のノードに集められるように、ネットワークを介して中間データを相互に移動させるシャッフル（Shuffle）処理が行われる。Shuffleの際、中間データはKey部を元にしてソートされ、１つのKeyに対して複数のValueの形式のリストが出力される。 Step 220) Shuffle (Shuffle) for moving intermediate data to each other through a network so that a user-defined reducer is activated at each node and records having the same key part of the intermediate data are collected in one node. ) Processing is performed. During Shuffle, the intermediate data is sorted based on the Key part, and a list of multiple Value formats is output for one Key.

ステップ２３０） Shuffleされた中間データそれぞれに対してReduce処理が行われる。 Step 230) Reduce processing is performed on each of the shuffled intermediate data.

上記の技術は、同じデータを入力として、複数のMapReduce処理（Grep処理）を行うものであり、中間データのキーへのタグの付与、中間データの削減処理（ジョブの事前統合と複数ジョブの同時実行）を行う(例えば、非特許文献２参照)。 The above technology performs multiple MapReduce processing (Grep processing) using the same data as input, adds tags to intermediate data keys, and reduces intermediate data (pre-job integration and simultaneous multiple jobs) (For example, see Non-Patent Document 2).

MRShare: Sharing Across Multiple Queries in MapReduce [Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, Nick Koudas, VLDB2010,2010年9月]MRShare: Sharing Across Multiple Queries in MapReduce [Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, Nick Koudas, VLDB2010, September 2010] MapReduce: Simple_ed Data Processing on Large Clusters [Jeffrey Dean, Sanjay Ghemawat, OSDI2004] http://static.googleusercontent.com/external content/untrusted dlcp/labs.google.com/ja//papers/mapreduce-osdi04.pdfMapReduce: Simple_ed Data Processing on Large Clusters [Jeffrey Dean, Sanjay Ghemawat, OSDI2004] http://static.googleusercontent.com/external content / untrusted dlcp / labs.google.com / en // papers / mapreduce-osdi04.pdf

機械学習処理は、事前に機械学習アルゴリズムに与える設定値次第で得られる処理結果の精度が大きく異なる場合があるため、設定値を調整しながら処理を複数回繰り返さなければ最良の結果を得ることができない。特に、MapReduceを利用して大規模なデータを対象とした処理を行う際に、例えば、機械学習ライブラリである"Mahout"では設定値の調整があまり考慮されないため、良い結果を得るために長時間に及ぶ処理を複数回行うことになり、非常に効率が悪い。 In machine learning processing, the accuracy of the processing result obtained depending on the setting value given to the machine learning algorithm in advance may vary greatly, so the best result can be obtained unless the processing is repeated multiple times while adjusting the setting value. Can not. In particular, when performing processing for large-scale data using MapReduce, for example, "Mahout", which is a machine learning library, does not take into account the adjustment of setting values, so it takes a long time to obtain good results. This process is very inefficient.

上記の非特許文献１の技術は、複数回の処理において、各処理のデータ読み込み部分の共有化により、中間データ削減によるShuffleコストは削減できるが、以前の処理の中間データを再利用することによる大幅な処理量の削減ができないという問題がある。 The technique of Non-Patent Document 1 described above can reduce the Shuffle cost due to the reduction of intermediate data by sharing the data reading portion of each process in a plurality of processes, but by reusing the intermediate data of the previous process. There is a problem that the processing amount cannot be significantly reduced.

本発明は、上記の点に鑑みなされたもので、過去の処理において利用した中間データを再利用できないという問題を解決し、大幅な処理量の削減が可能なデータ分析及び機械学習処理装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, solves the problem that the intermediate data used in the past processing cannot be reused, and enables data analysis and machine learning processing apparatus and method capable of greatly reducing the processing amount. And to provide a program.

上記の課題を解決するために、本発明（請求項１）は、１つのキーと値の組から一つ以上のキーと値の組を生成するマップ（Map）手段と、キーが同じ複数のキーと値の組から１つないしは複数のキーと値の組を生成するリデュース（Reduce）手段、及び、与えられた教師データを格納する分散ファイル記憶手段を有し、大規模なデータを並列分散処理するためのデータ分析・機械学習装置であって、
前記マップ手段及び前記リデュース手段は、
入力されたデータの処理中に、該マップ手段及び該リデュース手段で利用するパラメータに関する情報として、該データからパラメータの値、利用位置、利用元関数を検出し、前記分散ファイル記憶手段に格納する検出手段を有し、
ジョブに与えられたパラメータを参照し、前記分散ファイル記憶手段に格納されている前記パラメータに関する情報と照合し、所定の条件を満たす場合に、中間データをキャッシュとして該分散ファイル記憶手段、または、ローカルの記憶手段に格納する中間データ格納手段と、
ジョブに与えられたパラメータを参照し、前記中間データが前記分散ファイル記憶手段、または、前記ローカルの記憶手段に格納されている場合には、前記リデュース手段のみを実行させるスキップ処理手段と、
を有し、
前記リデュース手段は、
前記中間データが前記分散ファイル記憶手段、または、前記ローカルの記憶手段に格納されている場合には、該分散ファイル記憶手段、または、該ローカルの記憶手段から該中間データを読み込んで処理を行う手段を含む。 In order to solve the above problems, the present invention (Claim 1) includes a map means for generating one or more key / value pairs from one key / value pair, and a plurality of keys having the same key. Reduce means for generating one or a plurality of key / value pairs from key / value pairs, and distributed file storage means for storing given teacher data, paralleling large-scale data A data analysis / machine learning device for distributed processing,
The map means and the reduce means are:
Detection of parameter values, use positions, and use source functions from the data as information related to parameters used by the map means and the reduce means during processing of input data, and storing in the distributed file storage means Having means,
Refer to the parameter given to the job, collate with the information about the parameter stored in the distributed file storage means, and when the predetermined condition is satisfied, the intermediate file is used as a cache to the distributed file storage means or local Intermediate data storage means for storing in the storage means;
With reference to the parameters given to the job, if the intermediate data is stored in the distributed file storage means or the local storage means, a skip processing means for executing only the reduce means,
Have
The reducing means includes
When the intermediate data is stored in the distributed file storage unit or the local storage unit, the distributed file storage unit or a unit that reads the intermediate data from the local storage unit and performs processing including.

また、本発明（請求項２）は、請求項１の前記検出手段に、
前記マップ手段及び前記リデュース手段のクラス名、予め与えられているパラメータ名と値の該マップ手段及び該リデュース手段の処理中における利用順序、利用される関数を検出する手段を含む。 Further, the present invention (Claim 2) includes the detection means according to Claim 1,
A class name of the mapping unit and the reducing unit, a mapping unit that uses a predetermined parameter name and value, a usage order during processing of the reducing unit, and a unit that detects a function to be used.

また、本発明（請求項３）は、請求項１の前記中間データ格納手段に、
前記マップ手段の処理がパラメータに依存する、または、該パラメータの利用頻度が所定の閾値を超えている、または、ユーザにより保存することが指定されている場合に、前記中間データをキャッシュとして格納する手段を含む。 Further, the present invention (Claim 3) provides the intermediate data storage means according to Claim 1,
The intermediate data is stored as a cache when the processing of the mapping means depends on the parameter, the usage frequency of the parameter exceeds a predetermined threshold, or the user designates saving. Including means.

本発明（請求項４）は、１つのキーと値の組から一つ以上のキーと値の組を生成するマップ（Map）手段と、キーが同じ複数のキーと値の組から１つないしは複数のキーと値の組を生成するリデュース（Reduce）手段、及び、与えられた教師データを格納する分散ファイル記憶手段を有し、大規模なデータを並列分散処理する装置におけるデータ分析・機械学習方法であって、
前記マップ手段及び前記リデュース手段が、入力されたデータの処理中に、該マップ手段及び該リデュース手段で利用するパラメータに関する情報として、該データからパラメータの値、利用位置、利用元関数を検出し、前記分散ファイル記憶手段に格納する検出ステップと、
中間データ格納手段が、ジョブに与えられたパラメータを参照し、前記分散ファイル記憶手段に格納されている前記パラメータに関する情報と照合し、所定の条件を満たす場合に、中間データをキャッシュとして該分散ファイル記憶手段、または、ローカルの記憶手段に格納する中間データ格納ステップと、
スキップ処理手段が、ジョブに与えられたパラメータを参照し、前記中間データが前記分散ファイル記憶手段、または、前記ローカルの記憶手段に格納されている場合には、前記リデュース手段のみを実行させるスキップ処理ステップと、
前記リデュース手段が、前記中間データが前記分散ファイル記憶手段、または、前記ローカルの記憶手段に格納されている場合には、該分散ファイル記憶手段、または、該ローカルの記憶手段から該中間データを読み込んで処理を行うステップと、を行う。 According to the present invention (Claim 4), there is provided a map means for generating one or more key / value pairs from one key / value pair, and one or more key / value pairs having the same key. Has a reduction means for generating a plurality of key / value pairs and a distributed file storage means for storing given teacher data, and is a data analysis / machine in a device for parallel and distributed processing of large-scale data A learning method,
The map means and the reduce means detect a parameter value, a use position, and a use source function from the data as information on parameters used by the map means and the reduce means during processing of input data, Detecting in the distributed file storage means;
The intermediate data storage means refers to the parameter given to the job, collates with the information about the parameter stored in the distributed file storage means, and when the predetermined condition is satisfied, the intermediate data is used as a cache for the distributed file. An intermediate data storage step for storing in a storage means or a local storage means;
A skip processing unit refers to a parameter given to a job, and when the intermediate data is stored in the distributed file storage unit or the local storage unit, a skip process for executing only the reduce unit Steps,
When the intermediate data is stored in the distributed file storage unit or the local storage unit, the reducing unit reads the intermediate data from the distributed file storage unit or the local storage unit. And the step of performing the process.

また、本発明（請求項５）は、請求項４の前記検出ステップにおいて、
前記マップ手段及び前記リデュース手段のクラス名、予め与えられているパラメータ名と値の該マップ手段及び該リデュース手段の処理中における利用順序、利用される関数を検出する。 Further, the present invention (Claim 5) is the detection step of Claim 4,
The class names of the map means and the reduce means, the order of use of the parameter names and values given in advance during the processing of the map means and the reduce means, and the functions used are detected.

また、本発明（請求項６）は、請求項４の前記中間データ格納ステップにおいて、
前記マップ手段の処理がパラメータに依存する、または、該パラメータの利用頻度が所定の閾値を超えている、または、ユーザにより保存することが指定されている場合に、前記中間データをキャッシュとして格納する。 Further, the present invention (Claim 6) is the intermediate data storage step of Claim 4,
The intermediate data is stored as a cache when the processing of the mapping means depends on the parameter, the usage frequency of the parameter exceeds a predetermined threshold, or the user designates saving. .

本発明（請求項７）は、請求項１乃至３のいずれか１項に記載のデータ分析・機械学習装置を構成する各手段としてコンピュータを機能させるためのデータ分析・機械学習プログラムである。 The present invention (Claim 7) is a data analysis / machine learning program for causing a computer to function as each means constituting the data analysis / machine learning apparatus according to any one of Claims 1 to 3.

上記のように、本発明では、MapReduce処理中に、パラメータの利用順序、利用関数を検知し、その情報を活用して中間データをキャッシュとして残すかを判定し、所定の条件を満たした場合には中間データを分散ファイルシステムに格納しておき、以降の処理においてキャッシュした中間データを再利用することで、マップ（Map）処理、及び、シャッフル（Shuffle）処理をスキップして、リデュース(Reduce）処理のみを開始することにより、処理量を削減することができる。 As described above, in the present invention, during the MapReduce process, the use order of parameters and the use function are detected, it is determined whether to use the information to leave intermediate data as a cache, and when a predetermined condition is satisfied Stores intermediate data in a distributed file system and reuses intermediate data cached in subsequent processing, skipping map processing and shuffle processing, and reducing (Reduce) By starting only the processing, the processing amount can be reduced.

本発明の第１の実施の形態におけるシステム構成図である。It is a system configuration figure in a 1st embodiment of the present invention. 本発明の第１の実施の形態における全体動作のフローチャートである。It is a flowchart of the whole operation | movement in the 1st Embodiment of this invention. 本発明の第１の実施の形態における中間データを保存するMapReduceを利用した処理のフローチャート（S310）である。It is a flowchart (S310) of the process using MapReduce which preserve | saves the intermediate data in the 1st Embodiment of this invention. 本発明の第１の実施の形態における中間データ保存判定の処理のフローチャート（S620）である。It is a flowchart (S620) of the process of the intermediate data preservation | save determination in the 1st Embodiment of this invention. 本発明の第１の実施の形態における中間データを再利用するMapReduceを利用した処理のフローチャート（S320）である。It is a flowchart (S320) of the process using MapReduce which reuses the intermediate data in the 1st Embodiment of this invention. 本発明の第１の実施の形態における中間データ再利用判定処理のフローチャート（S920）である。It is a flowchart (S920) of the intermediate data reuse determination process in the 1st Embodiment of this invention. 本発明の第２の実施の形態におけるパラメータ情報を検知するMapReduceを利用した処理のフローチャート（S300）である。It is a flowchart (S300) of the process using MapReduce which detects the parameter information in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるパラメータ情報を検知するMapReduce処理のフローチャート（S420）である。It is a flowchart (S420) of MapReduce processing which detects parameter information in a 2nd embodiment of the present invention. 本発明の第２の実施の形態におけるパラメータ情報検知のイメージである。It is an image of the parameter information detection in the 2nd Embodiment of this invention. 本発明の第３の実施の形態における中間データを保存するMapReduce処理のフローチャート（S630）である。It is a flowchart (S630) of the MapReduce process which preserve | saves the intermediate data in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における中間データを再利用したMapReduce処理のフローチャート（S930）である。It is a flowchart (S930) of the MapReduce process which reused the intermediate data in the 3rd Embodiment of this invention. 本発明の一実施例のHadoopによるMapReduce処理開始時の各ノードの動きを示す図である。It is a figure which shows the motion of each node at the time of the MapReduce process start by Hadoop of one Example of this invention. クライアントノードの一般的な処理のフローチャートである。It is a flowchart of a general process of a client node. ジョブ追加時の処理のフローチャート（S1450）である。It is a flowchart (S1450) of processing when adding a job. 本発明の一実施例のジョブ追加時の処理のフローチャート（S1450）である。It is a flowchart (S1450) of the process at the time of the job addition of one Example of this invention. 本発明の一実施例のジョブスケジューラの処理のフローチャートである。It is a flowchart of the process of the job scheduler of one Example of this invention. 本発明の一実施例のTaskTrackerのタスク取得ループの処理のフローチャートである。It is a flowchart of a process of the task acquisition loop of TaskTracker of one Example of this invention. 本発明の一実施例のタスク開始時の処理のフローチャート（S1821,S1831）である。It is a flowchart (S1821, S1831) of the process at the time of the task start of one Example of this invention. 本発明の一実施例のパラメータ検知用Mapperの処理のフローチャート（S1920）である。It is a flowchart (S1920) of the process of the mapper for parameter detection of one Example of this invention. 本発明の一実施例の検知用Configurationクラスのパラメータ呼出時の処理のフローチャートである。It is a flowchart of the process at the time of the parameter call of the Configuration class for a detection of one Example of this invention. 本発明の一実施例の検知用Reducerの処理のフローチャート（S1920）である。It is a flowchart (S1920) of the process of the detection reducer of one Example of this invention. 本発明の一実施例のスキップ用Mapperの処理のフローチャート（S1920）である。It is a flowchart (S1920) of the process of the mapper for skip of one Example of this invention. MapReduce処理の流れである。This is the flow of MapReduce processing. MapReduceを利用した一般的なアプリケーション処理のフローチャートである。It is a flowchart of general application processing using MapReduce. 一般的なMapReducerの処理のフローチャートである。It is a flowchart of the process of general MapReducer.

以下図面と共に、本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

［第1の実施の形態］
図１は、本発明の第1の実施の形態におけるシステム構成を示す。 [First embodiment]
FIG. 1 shows a system configuration in the first embodiment of the present invention.

本発明の構成は、図２３と同様に、ネットワークによって相互に接続された複数のコンピュータを用い、図２３と同様の流れを処理が行われる分散処理フレームワークを行うシステムである。 As in FIG. 23, the configuration of the present invention is a system that uses a plurality of computers connected to each other via a network and performs a distributed processing framework in which processing is performed in the same flow as in FIG.

本実施の形態では、MapReduce処理において、集計用データ作成部（Mapper）２０、集計部（Reducer）４０においてパラメータに関する情報を検知し、保存する。それ以後、制御部（Controller）５０において、MapReduce処理の開始時に処理に利用するMapper・Reducer・パラメータと、保存した情報とを比較して、中間データをキャッシュとして分散ファイルシステム１０に保存するか否かを、ユーザの指定等により判定する。さらに、それ以降の処理において、分散ファイルシステム１０に保存されたキャッシュを利用するか否かを判定し、中間データを再利用することで、Map・Shuffle処理をスキップする。 In the present embodiment, in the MapReduce process, information related to the parameters is detected and stored in the totaling data creation unit (Mapper) 20 and the totaling unit (Reducer) 40. Thereafter, the controller 50 compares the saved information with the Mapper, Reducer, and parameters used for the processing at the start of the MapReduce process, and saves the intermediate data as a cache in the distributed file system 10. Is determined by user designation or the like. Further, in the subsequent processing, it is determined whether or not the cache stored in the distributed file system 10 is to be used, and the map / shuffle processing is skipped by reusing the intermediate data.

以下に、動作の詳細を説明する。 Details of the operation will be described below.

図２は、本発明の第1の実施の形態における全体動作のフローチャートである。 FIG. 2 is a flowchart of the overall operation in the first embodiment of the present invention.

ステップ３００）本ステップでは、図２４と同様のMapReduce処理を行うが、その過程でMapper２０やReducer40が利用するパラメータの値、利用位置、利用元関数を検知し、分散ファイルシステム１０に保存する。 Step 300) In this step, MapReduce processing similar to that shown in FIG. 24 is performed. In this process, parameter values, usage positions, and usage source functions used by the Mapper 20 and Reducer 40 are detected and stored in the distributed file system 10.

ステップ３１０）図２４に示すMapReduce処理の過程において、制御部(Controller)５０が、ジョブに与えられたパラメータを参照し、ステップ３００で得られたパラメータに関する情報と照合することで、中間データをキャッシュとして保存するべきか判定し、保存する場合は、中間データを分散ファイルシステム１０またはローカルに保存する。 Step 310) In the process of MapReduce processing shown in FIG. 24, the control unit (Controller) 50 refers to the parameter given to the job, and collates with the information about the parameter obtained in Step 300, thereby caching the intermediate data. In the case of saving, the intermediate data is saved in the distributed file system 10 or locally.

ステップ３２０）図２４に示すMapReduce処理の過程において、Controller５０は、ジョブに与えられたパラメータを参照し、ステップ３１０で得られた中間データのキャッシュを再利用可能かどうかを判定して、可能な場合はキャッシュを再利用することによってMap処理とShuffle処理をスキップし、Reducer４０のReduce処理のみを行う。 Step 320) In the process of the MapReduce process shown in FIG. 24, the Controller 50 refers to the parameter given to the job, determines whether the intermediate data cache obtained in Step 310 can be reused, and if possible Skips Map processing and Shuffle processing by reusing the cache, and only performs Reduce processing of Reducer 40.

次に、上記のステップ３１０の詳細な動作を説明する。 Next, the detailed operation of step 310 will be described.

図３は、本発明の第1の実施の形態における中間データを保存するMapReduceを利用した処理のフローチャートである。 FIG. 3 is a flowchart of processing using MapReduce for storing intermediate data according to the first embodiment of this invention.

ステップ６００）図２４と同様に、MapReduceジョブを生成する。 Step 600) As in FIG. 24, a MapReduce job is generated.

ステップ６１０）図２４と同様に、ジョブにパラメータを与える。 Step 610) As in FIG. 24, parameters are given to the job.

ステップ６２０） Controller５０は、ステップ６１０の出力であるジョブが保持するパラメータを元に、ステップ３００の出力である分散ファイルシステム１０に格納されたパラメータに関する情報を参照し、当該ジョブにおいて中間データをキャッシュとして保存するか否かを判定し、保存する場合は保存フラグ「真」を示すパラメータを新たに追加する。 Step 620) Based on the parameters held by the job that is the output of Step 610, the Controller 50 refers to the information related to the parameters stored in the distributed file system 10 that is the output of Step 300, and uses the intermediate data as a cache in the job. It is determined whether or not to save, and in the case of saving, a parameter indicating a save flag “true” is newly added.

ステップ６３０） MapReduce処理の過程で、Reduce処理を行う直前に、中間データをキャッシュとして分散ファイルシステム１０に保存する。 Step 630) In the course of the MapReduce process, immediately before performing the Reduce process, the intermediate data is stored in the distributed file system 10 as a cache.

ステップ６４０）図２４のステップ１３０と同様の処理を行う。 Step 640) The same processing as step 130 in FIG. 24 is performed.

次に、上記のステップ６２０の処理について詳細に説明する。 Next, the processing in step 620 will be described in detail.

図４は、本発明の第１の実施の形態における中間データ保存判定の処理フローチャートである。 FIG. 4 is a process flowchart of intermediate data storage determination in the first embodiment of the present invention.

ステップ７００） Controller５０は、Mapperクラス名、パラメータに関する情報が与えられると、Mapperクラス名を元に、パラメータに関する情報を参照し、Mapper２０の処理内においてパラメータ呼び出しがない場合は『真』を、そうでな場合は『偽』を出力する
ステップ７１０） Controller５０は、Map処理で利用するパラメータの全てにおいて、同じ値が設定される頻度がユーザが指定した閾値を超えているかを判定する。詳しくは、Mapperクラス名に基づいて、与えられたパラメータに関する情報を参照し、Mapper２０の処理内において利用されるパラメータのうち、全てがユーザにより設定された閾値を超える頻度で同じ値が設定される場合は『真』を、そうでない場合は『偽』を出力する。 Step 700) When the information about the mapper class name and the parameter is given, the controller 50 refers to the information about the parameter based on the mapper class name. If there is no parameter call in the processing of the mapper 20, “true” is set. If not, “false” is output. Step 710) The controller 50 determines whether the frequency with which the same value is set exceeds the threshold specified by the user in all the parameters used in the map processing. Specifically, based on the Mapper class name, the information about the given parameter is referred to, and the same value is set with a frequency that all of the parameters used in the processing of Mapper 20 exceed the threshold set by the user. If this is the case, “true” is output. Otherwise, “false” is output.

ステップ７２０） controller５０は、パラメータの集合を取得して、当該集合中に中間データ記憶部３０への保存を明示的に有効化するものがあるかを確認し、ある場合は『真』を、ない場合は『偽』を出力する。 Step 720) The controller 50 obtains a set of parameters, and checks whether there is any in the set that explicitly enables saving in the intermediate data storage unit 30, and if there is, it does not have “true”. In this case, “false” is output.

ステップ７３０）上記のステップ７００，７１０，７２０のいずれにおいて、『真』である場合は、Controller５０は、MapReduceジョブが保持するパラメータに対して中間データを有効化するフラグと、中間データを保存する場所を付与する。 Step 730) In any of the above Steps 700, 710, and 720, if "true", the Controller 50 validates the intermediate data for the parameter held by the MapReduce job, and the location for storing the intermediate data Is granted.

次に、図２のステップ３２０の処理について詳細に説明する。 Next, the process of step 320 in FIG. 2 will be described in detail.

図５は、本発明の第１の実施の形態における中間データを再利用するMapReduceを利用した処理のフローチャートである。 FIG. 5 is a flowchart of processing using MapReduce for reusing intermediate data according to the first embodiment of this invention.

ステップ９００） Controller５０は、図２４のステップ１００と同様にMapReduceジョブを生成する。 Step 900) The Controller 50 generates a MapReduce job as in Step 100 of FIG.

ステップ９１０）図２４のステップ１１０と同様に生成されたジョブにパラメータを与える。 Step 910) Parameters are given to the generated job in the same manner as Step 110 in FIG.

ステップ９２０）ステップ９１０の出力であるジョブが保持するパラメータを元に、当該ジョブにおいて利用可能な中間データがキャッシュとして分散ファイルシステム１０に保存されているか否かを判定し、利用可能な場合はキャッシュ利用フラグが『真』を示すパラメータを新たにジョブに追加する。 Step 920) Based on the parameters held by the job which is the output of Step 910, it is determined whether intermediate data usable in the job is stored in the distributed file system 10 as a cache. A parameter whose usage flag indicates “true” is newly added to the job.

ステップ９３０） Reducer４０は、キャッシュとして中間データ記憶部３０に保存されている中間データを利用して、Map処理及びShuffle処理をスキップしてReduce処理のみのMapReduce処理を行う。 Step 930) The Reducer 40 uses the intermediate data stored in the intermediate data storage unit 30 as a cache, skips the Map process and the Shuffle process, and performs the MapReduce process with only the Reduce process.

ステップ９４０）図２４のステップ１３０と同様の処理を行う。 Step 940) The same processing as step 130 in FIG. 24 is performed.

次に、上記の５のステップ９２０の処理を説明する。 Next, the processing of step 5 920 will be described.

図６は、本発明の第１の実施の形態における中間データ再利用判定処理のフローチャートである。 FIG. 6 is a flowchart of intermediate data reuse determination processing according to the first embodiment of this invention.

ステップ１０００）ジョブが保持するパラメータを元に、分散ファイルシステム１０内に利用できるキャッシュ（同じ入力データ、同じパラメータを利用して処理したときの中間データ）があるかどうかを判定する。 Step 1000) Based on the parameters held by the job, it is determined whether there is a cache (same input data, intermediate data when processed using the same parameters) that can be used in the distributed file system 10.

ステップ１０１０）ある場合は、ジョブに対して利用するキャッシュ保存場所を含めたキャッシュ利用フラグを付与する。 Step 1010) If there is, a cache use flag including the cache storage location used for the job is assigned.

［第２の実施の形態］
本実施の形態は、第１の実施の形態の図２のステップ３００の処理において、MapReduce処理に使われるMapperクラス・Reduceクラスの名前、事前に与えるパラメータ名と値、それがMapReduce処理中のどのような順序で、どの関数内で利用されるかを検出し、その情報を分散ファイルシステム１０に保存することで、以後の処理において利用できるようにするものである。 [Second Embodiment]
In this embodiment, in the processing of step 300 in FIG. 2 of the first embodiment, the names of Mapper class and Reduce class used in MapReduce processing, parameter names and values given in advance, which are in MapReduce processing In this order, it is detected in which function it is used, and the information is stored in the distributed file system 10 so that it can be used in subsequent processing.

本実施の形態におけるシステム構成は第１の実施の形態と同様である。 The system configuration in this embodiment is the same as that in the first embodiment.

図７は、本発明の第２の実施の形態におけるパラメータ情報を検知するMapReduceを利用した処理のフローチャートである。 FIG. 7 is a flowchart of processing using MapReduce for detecting parameter information according to the second embodiment of the present invention.

以下の処理は、第１の実施の形態の図２のステップ３００の処理に対応する。 The following processing corresponds to the processing in step 300 in FIG. 2 of the first embodiment.

ステップ４００）図２４のステップ１００と同様である。 Step 400) Same as step 100 in FIG.

ステップ４１０）図２４のステップ１１０と同様である。 Step 410) The same as step 110 in FIG.

ステップ４２０） Mapper２０及びReducer４０は、入力された任意のデータから、内部で利用されるパラメータの値、利用位置、利用元関数を検知し、分散ファイルシステム１０に保存しながらMapReduce処理を行う。 Step 420) The Mapper 20 and the Reducer 40 detect parameter values, use positions, and use source functions used internally from arbitrary input data, and perform MapReduce processing while storing them in the distributed file system 10.

ステップ４３０）図２４のステップ１３０と同様である。 Step 430) Same as step 130 in FIG.

次に、上記のステップ４２０の処理について詳細に説明する。 Next, the process of step 420 will be described in detail.

図８は、本発明の第２の実施の形態におけるパラメータ情報を検知するMapReduce処理のフローチャートである。 FIG. 8 is a flowchart of MapReduce processing for detecting parameter information according to the second embodiment of the present invention.

ステップ５００）各ノードに割り当てられたデータが入力されると、Mapper２０は、ユーザが定義した任意のMap処理を行う。各ノードが自らに割り当てた分散データを先頭から順に１行に対して１回、ユーザが定義したMap関数が適用される。その間パラメータが利用される際に、利用したMapperクラス名、関数、パラメータ名、パラメータ値を分散ファイルシステム１０に保存する。 Step 500) When the data assigned to each node is input, the Mapper 20 performs an arbitrary Map process defined by the user. The map function defined by the user is applied once to each row of the distributed data assigned to each node in order from the top. In the meantime, when the parameter is used, the used Mapper class name, function, parameter name, and parameter value are stored in the distributed file system 10.

ステップ５１０）ステップ５００で生成されたKey-Value形式のデータが入力されると、キーが共通のものを集め（キー１つにリスト状の値となる）、ユーザが定義した任意のCombine処理を行う。その間、パラメータが利用される際に、利用したCombinerクラス名、関数、パラメータ名、パラメータ値を分散ファイルシステム１０に格納する。 Step 510) When the data in the key-value format generated in step 500 is input, the keys having the same key are collected (a single key is a list-like value), and an arbitrary combine process defined by the user is performed. Do. Meanwhile, when the parameter is used, the used Combiner class name, function, parameter name, and parameter value are stored in the distributed file system 10.

ステップ５２０） Reducer４０において、図２５のステップ２２０と同様の処理を行う。 Step 520) In the Reducer 40, the same processing as in Step 220 of FIG. 25 is performed.

ステップ５３０） Reducer４０は、入力に対してユーザが定義した任意のReduce処理を行う。その間パラメータが利用される際に、利用したReducerクラス名、関数、パラメータ名、パラメータ値を分散ファイルシステム１０に保存する。 Step 530) The Reducer 40 performs an arbitrary Reduce process defined by the user on the input. In the meantime, when the parameter is used, the used Reducer class name, function, parameter name, and parameter value are stored in the distributed file system 10.

図９に実装時の処理イメージを示す。オープンソフトウェアであるMapReduceは、「Hadoop」を利用することにより容易に実現可能である。Hadoopにおいて、ジョブに予め与える全てのパラメータは"Configuration"という名のクラスで管理されており、MapReduce処理の過程でMapperやReducerはController(制御部５０または、JobTracker(後述する図１２))からこのConfigurationがファイルとして保存された場所を受け取り、そのファイルを読み込んで利用する。Configurationには、Mapper及びReducerが利用する(例えば、機械学習アルゴリズムに関する)パラメータの他、JobTrackerが利用するパラメータなども含む全てのパラメータが格納される。Configurationはパラメータをname/valueの組で管理しており、nameが問い合わせられたら対応するvalueを返すという形で利用される。本発明では、同図下段に示すように、これを検知用のものに入れ替えることで、パラメータ情報の検知を実現する。具体的には、パラメータ利用のために、Configurationにnameが送られたとき、単純にvalueを返答するだけでなく、同時に問い合わせ元のクラス名・関数名・パラメータ値などの情報を取得して別途HDFSに保存する。 FIG. 9 shows a processing image at the time of mounting. MapReduce, which is open software, can be easily realized by using “Hadoop”. In Hadoop, all parameters given to a job in advance are managed by a class named “Configuration”, and Mapper and Reducer are controlled from the controller (control unit 50 or JobTracker (FIG. 12 described later)) in the process of MapReduce processing. Receives the location where Configuration is saved as a file, and reads and uses that file. Configuration stores all parameters including parameters used by JobTracker, in addition to parameters used by Mapper and Reducer (for example, regarding machine learning algorithms). Configuration manages parameters as name / value pairs. When name is queried, it returns the corresponding value. In the present invention, as shown in the lower part of the figure, the parameter information is detected by replacing it with a detection one. Specifically, when name is sent to Configuration for parameter use, not only simply return value, but also acquire information such as class name, function name, parameter value of inquiry source at the same time. Save to HDFS.

［第３の実施の形態］
本実施の形態は、第２の実施の形態によって得られた情報を利用して、Map処理がパラメータに依存しない場合、統計的に同じパラメータが頻繁に利用される場合、ユーザが指定する場合に、Map処理の結果である中間データをキャッシュとして分散ファイルシステム１０に保存する。以降の処理において、保存した中間データをキャッシュとして利用可能な場合は、それを読み込んで、Reduce処理のみを行い、Map処理及びShuffle処理は行わない。 [Third Embodiment]
This embodiment uses the information obtained by the second embodiment, when the map processing does not depend on parameters, when the same parameter is used statistically frequently, or when specified by the user The intermediate data as a result of the map processing is stored in the distributed file system 10 as a cache. In the subsequent processing, if the saved intermediate data can be used as a cache, it is read and only the Reduce processing is performed, and the Map processing and Shuffle processing are not performed.

図１０は、本発明の第３の実施の形態における中間データを保存するMapReduce処理のフローチャートである。同図に示す処理は、第１の実施の形態の図３のステップ６３０の処理に対応する。 FIG. 10 is a flowchart of MapReduce processing for storing intermediate data according to the third embodiment of the present invention. The process shown in the figure corresponds to the process of step 630 in FIG. 3 of the first embodiment.

ステップ８００）図２５のステップ２００と同様にMap処理を行う。 Step 800) Map processing is performed in the same manner as Step 200 in FIG.

ステップ８１０）図２５のステップ２１０と同様にCombine処理を行う。 Step 810) Combine processing is performed in the same manner as Step 210 in FIG.

ステップ８２０）図２５のステップ２２０と同様に、Shuffle処理を行う。 Step 820) Shuffle processing is performed as in step 220 of FIG.

ステップ８３０）上記の処理による中間データをキャッシュとして分散ファイルシステム１０に出力する。 Step 830) The intermediate data obtained by the above processing is output to the distributed file system 10 as a cache.

ステップ８４０）図２５のステップ２３０と同様の処理を行う。但し、Reducer４０に対して中間データ記憶部３０に保存されているShuffle済みの中間データを直接流し込むことで、Map処理及びShuffle処理をスキップする。 Step 840) The same processing as step 230 in FIG. 25 is performed. However, Map processing and Shuffle processing are skipped by directly flowing Shuffled intermediate data stored in the intermediate data storage unit 30 into the Reducer 40.

次に、上記の図５のステップ９３０の処理を説明する。 Next, the processing in step 930 in FIG. 5 will be described.

図１１は、本発明の第３の実施の形態における中間データを再利用したMapReduce処理のフローチャートである。 FIG. 11 is a flowchart of MapReduce processing that reuses intermediate data according to the third embodiment of the present invention.

ステップ１１００） Reducer４０は、キャッシュとして分散ファイルシステム１０に保存されている中間データを取得する。 Step 1100) The Reducer 40 acquires intermediate data stored in the distributed file system 10 as a cache.

ステップ１１１０） Reducer４０は、取得した中間データを用いて、図２５のステップ２３０と同様の処理を行う。 Step 1110) The Reducer 40 performs the same processing as Step 230 of FIG. 25 using the acquired intermediate data.

以上、本発明の実施の形態を説明したが、本発明では、ユーzが中間データのキャッシュを陽に指定しない限り（自動的にキャッシュを行う場合）、最低３回のMapReduceを行う。１回目はパラメータ情報の検知、２回目で１回目の情報を利用した中間データのキャッシュ判定及びキャッシュ、３回目で２回目においてキャッシュした中間データを利用する処理量を削減する。 Although the embodiment of the present invention has been described above, in the present invention, MapReduce is performed at least three times unless the user z explicitly specifies the intermediate data cache (when automatically performing the cache). The first time is detection of parameter information, the second time is cache determination and caching of intermediate data using the first time information, and the third time is reduced the amount of processing using the intermediate data cached in the second time.

以下、本発明をオープンソース分散システムの「Hadoop」に適用した例を示す。 Hereinafter, an example in which the present invention is applied to “Hadoop” of an open source distributed system will be described.

以下では、Hadoopにおいて、「パラメータ」は具体的にはConfigurationクラスを指す。Configurationクラスは、Hadoopによる一連のMapReduce処理の流れの中で、Hadoopそのもののパラメータ、ユーザが定義したMapper、Reducerを指定するパラメータ、MapperやReducer等で利用される変数等のパラメータなど、あらゆるパラメータを区別なく格納、管理するためのクラスであり、MapReduceに関わる様々なクラスがConfigurationクラスを継承あるいはConfigurationクラスを保持する形で利用する。パラメータは、「名前：値」の形でConfigurationクラスに格納され、必要とする型に応じたメソッドにパラメータの名前を与えると、その型の値が応答される。HDFSやローカルファイルに出力するために、XML形式に変換するメソッドを持つ。 Hereinafter, in Hadoop, “parameter” specifically refers to the Configuration class. The Configuration class includes all parameters such as parameters of Hadoop itself, parameters defined by the user-defined Mapper and Reducer, variables such as variables used by Mapper and Reducer, etc. in the series of MapReduce processing flow by Hadoop. It is a class for storing and managing without distinction, and various classes related to MapReduce use it by inheriting the Configuration class or holding the Configuration class. Parameters are stored in the Configuration class in the form of “name: value”. When a parameter name is given to a method corresponding to a required type, a value of that type is returned. Has a method to convert to XML format for output to HDFS and local files.

「ジョブ（jobまたはJobConf）」は、Hadoopにおいて、具体的にはJobクラス、もしくは、JobConfクラスを指す。HadoopによるMapReduce処理を開始する際、ジョブを生成し、ジョブが提供する各種メソッドを用いて処理に利用するMapperクラスやReduceクラスなどを与え、開始メソッドによってMapReduce処理を開始することができる。Configurationクラスを内包し、与えられたパラメータは基本的に全てそちらに格納する。 “Job (job or JobConf)” specifically refers to the Job class or the JobConf class in Hadoop. When starting MapReduce processing with Hadoop, you can create a job, give the Mapper class or Reduce class used for processing using various methods provided by the job, and start MapReduce processing with the start method. It contains the Configuration class, and basically all the given parameters are stored there.

なお、後述する「JobTracker」は前述の実施の形態における「制御部（Controller）５０」に対応する。 Note that “JobTracker” to be described later corresponds to “control unit (Controller) 50” in the above-described embodiment.

図１２は、HadoopによるMapReduce処理開始時の各ノードの動作を示す。 FIG. 12 shows the operation of each node at the start of MapReduce processing by Hadoop.

同図に示すシステムは、JobClient１３０２、MapReduceアプリケーション１３０１を有するクライアントノード１３００、JobTracker１３１１を有するJobTrackerノード１３１０、TaskTracker１３２１、Map or Reduceタスク１３２２を有するTrack Trackerノード１３２０から構成される。 The system shown in the figure includes a JobClient 1302, a client node 1300 having a MapReduce application 1301, a JobTracker node 1310 having a JobTracker 1311, a TaskTracker 1321, and a Track Tracker node 1320 having a Map or Reduce task 1322.

まず、一般的なクライアントノード１３００の動作を説明する。 First, the operation of a general client node 1300 will be described.

図１３は、クライアントノードの一般的な処理のフローチャートである。 FIG. 13 is a flowchart of general processing of the client node.

ステップ１４００）クライアントノード１３００のJobClient１３０２は、Hadoopに含まれるJobクラスのインスタンスの生成時に、XML形式の初期パラメータが読み込まれ、Configurationクラスとしてジョブに与えられる。 Step 1400) When the JobClient 1302 of the client node 1300 generates an instance of the Job class included in Hadoop, the initial parameter in the XML format is read and given to the job as the Configuration class.

ステップ１４１０） Mapper・Reducer・Key・ValueクラスなどのHadoopが必要とするパラメータや、処理の途中に利用する変数の初期値などのパラメータをジョブに与え（ジョブが保持するconfigurationにパラメータを与える）、固定のパラメータが与えられたジョブインスタンスを出力する。 Step 1410) Give parameters to parameters such as Mapper / Reducer / Key / Value classes required by Hadoop and initial values of variables used during processing (giving parameters to the configuration held by the job) Output a job instance given a fixed parameter.

ステップ１４２０）入力データの場所や実行時にユーザが変更可能なパラメータをコマンドライン引数の解析などによって取得し、ジョブに与える（ジョブが保持する Configurationにパラメータを与える）。 Step 1420) The location of the input data and the parameters that can be changed by the user at the time of execution are obtained by analyzing the command line arguments and the like are given to the job (the parameter is given to the configuration held by the job).

ステップ１４３０） JobClient１３０２を利用してJobTrackerが動作しているサーバから新規のジョブのIDを取得し、ジョブに与える。 Step 1430) Using JobClient 1302, the ID of the new job is acquired from the server on which JobTracker is operating and given to the job.

ステップ１４４０）ステップ１４３０で取得したジョブIDを元にした分散ファイルシステム１３３０上の所定の場所（パス）に、ステップ１４００、１４１０、１４２０によって生成、設置付与がなされたジョブをXMLとして出力する（また、入力データを示すファイルも所定の場所に出力する）。 (Step 1440) The job generated and set by Steps 1400, 1410, and 1420 is output as XML to a predetermined location (path) on the distributed file system 1330 based on the job ID acquired in Step 1430 (or A file indicating input data is also output to a predetermined location).

ステップ１４５０） JobClient１３０２を利用して新しいジョブの追加をJobTracker１３１１に通知する。詳細について後述する。 Step 1450) The job tracker 1311 is notified of the addition of a new job using the JobClient 1302. Details will be described later.

ステップ１４６０） JobTracker１３１１において追加されたジョブが実行中ジョブとなり、キューに蓄積される。適宜キューを元にMapReduce処理が実行され、処理のログを標準出力に表示しながら終了まで待機する（Task Trackerノード１３２０のMap or Reduceタスク１３２２が全て終了まで待機する）。 Step 1460) The job added in JobTracker 1311 becomes an executing job and is accumulated in the queue. MapReduce processing is executed based on the queue as appropriate, and waits until the end while displaying the processing log on the standard output (all the Map or Reduce tasks 1322 of the Task Tracker node 1320 wait until the end).

ステップ１４６０）ステップ１４６０の処理が終了し、MapReduce処理のログ情報をMapReduce処理の結果として取得し、ユーザが定義した任意の処理を行う。 Step 1460) The processing of Step 1460 ends, the log information of MapReduce processing is acquired as the result of MapReduce processing, and arbitrary processing defined by the user is performed.

図１４は、ジョブ追加時の処理のフローチャートであり、上記のステップ１４５０の詳細を示す。 FIG. 14 is a flowchart of processing when a job is added, and shows details of step 1450 described above.

ステップ１５００）入力されたジョブIDを元にしてジョブの初期化を行い、実行ジョブ(JobInProgress)を生成する。 Step 1500) The job is initialized based on the input job ID, and an execution job (JobInProgress) is generated.

ステップ１５１０）実行ジョブをキューに追加する。 Step 1510) The execution job is added to the queue.

上記に示した処理は一般的な処理であるが、本発明では、図１４の処理の代わりに、図１５に示す処理を行う。 The process shown above is a general process, but in the present invention, the process shown in FIG. 15 is performed instead of the process shown in FIG.

図１５は、本発明の一実施例のジョブ追加時の処理のフローチャートである。 FIG. 15 is a flowchart of processing when a job is added according to an embodiment of the present invention.

ステップ１６００）クライアントノード１３００のJobClient１３０２において、図２５のステップ２００と同様に、実行ジョブを生成する。 Step 1600) In the JobClient 1302 of the client node 1300, an execution job is generated as in Step 200 of FIG.

ステップ１６１０）ジョブが保持するConfigurationを参照し、パラメータ検知設定が有効か判定する。有効である場合はステップ１６２０に移行し、有効でない場合は、ステップ１６１１に移行する。 Step 1610) With reference to Configuration held by the job, it is determined whether the parameter detection setting is valid. If it is valid, the process proceeds to step 1620. If it is not valid, the process proceeds to step 1611.

ステップ１６１１） Mapper・Reducerクラス名、入力データ名、パラメータを用いて分散ファイルシステム１３３０上のディレクトリを検索し、再利用可能な中間データのキャッシュがあるかどうか判定する。ある場合はステップ１６１２に移行し、ない場合は、ステップ１６４０に移行する。 Step 1611) The directory on the distributed file system 1330 is searched using the Mapper / Reducer class name, the input data name, and the parameter, and it is determined whether there is a reusable intermediate data cache. If there is, the process proceeds to step 1612, and if not, the process proceeds to step 1640.

ステップ１６１２）ジョブのConfigurationに対してReducerの処理が無しになるように操作し、Mapperクラスをスキップ用のものに入れ替え、ステップ１６４０に移行する。 Step 1612) An operation is performed so that the Reducer processing is not performed on the configuration of the job, the Mapper class is replaced with a skipping one, and the process proceeds to Step 1640.

ステップ１６２０）設定済みのMapper、Reducerは別の設定名として退避し、パラメータ検知用のMapper、Reducerに入れ替える。また、入れ替え済みのConfigurationの内容を分散ファイルシステム１３３０上に保存されているXMLにも反映する。 Step 1620) The set mapper and reducer are saved as different setting names and replaced with the parameter detection mapper and reducer. The contents of the replaced Configuration are also reflected in the XML stored on the distributed file system 1330.

ステップ１６３０）ユーザによる明示的なキャッシュ有効指定がある場合には、常に『真』を、また、Mapper・Reducerクラスタ名、入力データ名、パラメータ名を用いて分散ファイルシステム１３３０上に蓄積されたパラメータの利用クラス・関数の情報を検索し、Mapperがパラメータに依存しない場合や、依存していても同じパラメータ値をユーザが指定する回数以上利用している場合に『真』を出力する。それ以外は『偽』を出力してステップ１６４０に移行する。 Step 1630) When the cache is explicitly specified by the user, always "true", and the parameter stored on the distributed file system 1330 using the Mapper / Reducer cluster name, input data name, and parameter name Searches the usage class / function information of, and outputs “true” when Mapper does not depend on the parameter or when the same parameter value is used more than the number of times specified by the user. Otherwise, “false” is output and the process proceeds to step 1640.

ステップ１６３１）ステップ１６３０において、『真』である場合には、実行ジョブのConfigurationに対して、中間データをキャッシュするためのフラグを付与する。 Step 1631) In the case of “true” in Step 1630, a flag for caching the intermediate data is given to the configuration of the execution job.

ステップ１６４０）図１４のステップ１５１０と同様に、実行ジョブをキューに格納する。 Step 1640) Similar to step 1510 of FIG. 14, the execution job is stored in the queue.

次に、図１３のJobTrackerノード１３１０のJobTracker１３１１のジョブスケジューラの処理について説明する。 Next, the job scheduler process of the JobTracker 1311 of the JobTracker node 1310 in FIG. 13 will be described.

図１６は、本発明の一実施例のジョブスケジューラの処理のフローチャートである。 FIG. 16 is a flowchart of processing of the job scheduler according to the embodiment of this invention.

ステップ１７００） JobTracker1311は、キューの中に実行ジョブがあるかどうかを判定し、ある場合は、ステップ１７１０に移行し、ない場合にはキュー待ちする。 Step 1700) The JobTracker 1311 determines whether or not there is an execution job in the queue. If there is an execution job, the jobtracker 1311 proceeds to Step 1710, and if not, waits for the queue.

ステップ１７１０）キューから実行ジョブを取得する。 Step 1710) An execution job is acquired from the queue.

ステップ１７２０）実行ジョブを元にして分散ファイルシステム１３３０の適切な場所からジョブのリソースを取得し、必要な数のMapタスク、Reduceタスクを生成する。 Step 1720) The job resource is acquired from an appropriate location of the distributed file system 1330 based on the execution job, and the necessary number of Map tasks and Reduce tasks are generated.

ステップ１７３０）終了判定として、終了フラグがあるかを判定する。終了フラグがある場合は当該処理を終了し、ない場合はステップ１７００に移行する。 Step 1730) As an end determination, it is determined whether there is an end flag. If there is an end flag, the process ends. If not, the process proceeds to step 1700.

次に、TaskTrackerノード１３２０のTaskTracker１３２１のタスク取得処理について説明する。 Next, the task acquisition process of the TaskTracker 1321 of the TaskTracker node 1320 will be described.

図１７は、本発明の一実施例のTaskTrackerのタスク取得ループの処理のフローチャートである。 FIG. 17 is a flowchart of the task acquisition loop processing of the TaskTracker according to the embodiment of this invention.

ステップ１８００） TaskTrackerノード１３２０のTaskTracker１３２１は、JobTrackerノード１３１０のJobTracker１３１１に対して、自らの生存を確認させるためにハートビート（Heat Beat）を送出し、JobTracker１３１１からそれに対する応答を受け取る。 Step 1800) The TaskTracker node 1320 of the TaskTracker node 1320 sends a heart beat (Heat Beat) to the JobTracker 1311 of the JobTracker node 1310 to confirm its own existence, and receives a response from the JobTracker 1311.

ステップ１８１０） TaskTracker１３２１は、JobTracker１３１１からの応答にタスクが含まれているかどうかを判定し、含まれている場合はステップ１８２０に移行し、含まれていない場合はステップ１８００に移行する。 Step 1810) The TaskTracker 1321 determines whether or not a task is included in the response from the JobTracker 1311. If it is included, the process moves to Step 1820. If not, the process moves to Step 1800.

ステップ１８２０） TaskTracker１３２１は、タスクがMapタスクであるかどうかを判定する。Mapタスクである場合はステップ１８２１に移行し、そうでない場合はステップ１８３０に移行する。 Step 1820) The TaskTracker 1321 determines whether or not the task is a Map task. If it is a Map task, the process proceeds to step 1821; otherwise, the process proceeds to step 1830.

ステップ１８２１） TaskTracker１３２１は、与えられたMapタスクを実行し、ステップ１８００に移行する。詳細については図１８で後述する。 Step 1821) The TaskTracker 1321 executes the given Map task, and proceeds to Step 1800. Details will be described later with reference to FIG.

ステップ１８３０）タスクがReduceタスクであるかを判定する。Reduceタスクである場合はステップ１８３１に移行し、そうでない場合はステップ１８４０に移行する。 Step 1830) It is determined whether the task is a Reduce task. If it is a Reduce task, the process proceeds to step 1831; otherwise, the process proceeds to step 1840.

ステップ１８３１）与えられたReduceタスクを実行し、ステップ１８００に移行する。詳細は図１８で後述する。 Step 1831) The given Reduce task is executed, and the process proceeds to Step 1800. Details will be described later with reference to FIG.

ステップ１８４０）終了フラグがあれば、処理を終了し、なければステップ１８００に移行する。 Step 1840) If there is an end flag, the process is ended; otherwise, the process proceeds to Step 1800.

次に、上記のステップ１８２１，１８３１のタスク（Mapタスク、Reduceタスク）の開始の処理について説明する。 Next, processing for starting the tasks (Map task, Reduce task) in steps 1821 and 1831 will be described.

図１８は、本発明の一実施例のタスク開始時の処理のフローチャートである。 FIG. 18 is a flowchart of processing at the start of a task according to an embodiment of the present invention.

ステップ１９００） TaskTracker１３２１は、パラメータ（Configuration）、ジョブの進捗状況（Status）、入出力などを含むContextクラスをタスクに応じて生成する。 Step 1900) The TaskTracker 1321 generates a Context class including parameters (Configuration), job progress (Status), input / output, and the like according to the task.

ステップ１９１０）与えられたタスクに応じたクラス（MapperクラスまたはReducerクラス）を生成する。 Step 1910) A class (Mapper class or Reducer class) corresponding to a given task is generated.

ステップ１９２０）生成したContextクラスをMapperクラスまたはReducerクラスに与え、実行する。Mapperクラスの場合は、図１９の処理、Reducerクラスの場合は図２０の処理を実行する。 Step 1920) Give the generated Context class to the Mapper class or Reducer class and execute it. In the case of the Mapper class, the process of FIG. 19 is executed, and in the case of the Reducer class, the process of FIG. 20 is executed.

上記のステップ１９１０でMapperクラスが生成された場合には、以下の処理を行う。 When the Mapper class is generated in the above step 1910, the following processing is performed.

図１９は、本発明の一実施例のMapperクラスが生成された場合の処理のフローチャートである。 FIG. 19 is a flowchart of processing when a Mapper class according to an embodiment of the present invention is generated.

ステップ２２００） Contextクラス（設定、タスクID、データの入出力、ジョブステータス、カウンタ等を持つクラス）、Mapperとして利用しているクラスが入力されると、Contextクラスが保持するConfigurationクラスを取り出し、それをパラメータが利用されたクラス・関数・順序を検知するためのConfigurationに入れ替える。 Step 2200) When the Context class (class with settings, task ID, data input / output, job status, counter, etc.) and the class used as Mapper are input, the Configuration class held by the Context class is extracted and Is replaced with Configuration for detecting the class, function, and order in which the parameter is used.

ステップ２２１０） Mapperクラスの名前をConfigurationから取り出し、通常のMapperクラスを生成する。 Step 2210) The name of the Mapper class is extracted from the Configuration, and a normal Mapper class is generated.

ステップ２２２０）ユーザの定義した任意の事前処理のみを行う。ここでは、MapperやReducerの中の関数名をそれぞれMapper内、Reducer内のものであることが明示的にわかるような関数名（検知が容易な関数名）にしておくことで、Mapper・Reducerどちらからパラメータが利用されたかを容易に判定できるようになる。 Step 2220) Only any pre-processing defined by the user is performed. Here, the function names in Mapper and Reducer are set to function names (function names that are easy to detect) that can be clearly identified in Mapper and Reducer. Thus, it can be easily determined whether the parameter is used.

ステップ２２３０）割り当てられた入力データの一部及びContextクラスが入力されると、入力の次の１行に対して、検知が容易な関数名を用いてユーザの定義した任意のMap処理のみを行う。 Step 2230) When a part of the assigned input data and the Context class are input, only the arbitrary Map processing defined by the user is performed on the next line of the input using a function name that is easy to detect. .

ステップ２２４０）ステップ２２３０で読み込んだ行がデータの最終行であるかどうかを判断する。 Step 2240) It is determined whether or not the line read in Step 2230 is the last line of data.

ステップ２２５０）検知が容易な関数名を用いてユーザの定義した任意の終了処理のみを行う。 Step 2250) Only an arbitrary end process defined by the user is performed using a function name that is easy to detect.

次に、上記の図１９のステップ２２００で生成された検知用コンテキストに対してパラメータを要求するときの処理を説明する。 Next, processing when requesting a parameter for the detection context generated in step 2200 of FIG. 19 will be described.

図２０は、本発明の一実施例の検知用Configurationクラスのパラメータ呼び出し時の処理のフローチャートである。 FIG. 20 is a flowchart of processing when a parameter of the detection Configuration class is called according to an embodiment of the present invention.

ステップ２３００） Java（登録商標）の機能であるStack Traceを用いて、当該関数を呼び出した関数を辿り、呼び出し元のMapperクラス名と関数名を検知する。 Step 2300) Using Stack Trace, which is a function of Java (registered trademark), the function that called the function is traced, and the Mapper class name and function name of the call source are detected.

ステップ２３１０）分散ファイルシステム１３３０上に、ステップ２３００の出力を呼び出し元クラス名、順序（数値）、関数名、パラメータ名をディレクトリ名、値をファイル名、頻度を内容としてファイルを生成し、分散ファイルシステム１３３０に出力する。 Step 2310) On the distributed file system 1330, generate the file on the distributed file system 1330 using the caller class name, order (numerical value), function name, parameter name as the directory name, value as the file name, and frequency as the content. Output to system 1330.

ステップ２３２０）保持しているプロパティクラスから、パラメータ名をキーとして値を取得し、呼び出したメソッドに適切な型に変換する。 Step 2320) A value is acquired from the retained property class using the parameter name as a key, and converted into a type appropriate for the called method.

次に、図１８のステップ１９２０において、検知用Reducerを実行する場合の処理について説明する。 Next, processing in the case of executing the detection reducer in step 1920 of FIG. 18 will be described.

図２１は、本発明の一実施例の検知用Reducerの処理のフローチャートである。 FIG. 21 is a flowchart of the processing of the detection reducer according to the embodiment of the present invention.

ステップ２５００） Contextクラス（設定、タスクID,）データの入出力、ジョブステータス、カウンタ等を持つクラス）、Reducerとして利用しているクラスの名前が入力されると、Contextクラスが保持するConfigurationクラスを取り出し、それをパラメータが利用されたクラス、関数、順序を検知するためのConfigurationに入れ替える。 Step 2500) When the Context class (setting, task ID, data input / output, job status, counter, etc. class) and the name of the class used as a reducer are entered, the Configuration class held by the Context class Take it out and replace it with Configuration to detect the class, function, and order in which the parameters were used.

ステップ２５１０） Reducerクラスの名前をConfigurationから取得し、通常のReducerクラスを生成する。 Step 2510) The name of the Reducer class is acquired from the Configuration, and a normal Reducer class is generated.

ステップ２５２０）検知が容易な関数名を用いて、ユーザの定義した任意の事前処理のみを行う。 Step 2520) Only arbitrary pre-processing defined by the user is performed using a function name that is easy to detect.

ステップ２５３０） Contextクラス内のConfigurationにキャッシュ保存フラグがあるかを判定する。ある場合は、ステップ２５４０に移行し、ない場合はステップ２５５０に移行する。 Step 2530) It is determined whether there is a cache saving flag in the Configuration in the Context class. If there is, the process proceeds to step 2540, and if not, the process proceeds to step 2550.

ステップ２５４０）分散ファイルシステム１３３０内の所定の場所に、Mapperクラス名、Reduceクラス名、入力データ名、Keyをディレクトリとし、Valueイテレータの内容を全てそのディレクトリ内に出力する。 Step 2540) The Mapper class name, Reduce class name, input data name, and Key are set to a directory at a predetermined location in the distributed file system 1330, and all contents of the Value iterator are output to the directory.

ステップ２５５０）割り当てられた中間データ（Key，Value）に対して、検知が容易な関数を用いてReduce処理を行う。 Step 2550) Reduce processing is performed on the assigned intermediate data (Key, Value) using a function that is easy to detect.

ステップ２５６０）ユーザが定義した任意の終了処理のみを行う。 Step 2560) Only an arbitrary end process defined by the user is performed.

次に、図１８のステップ１９２０において、スキップ用Mapperを実行する場合の処理について説明する。 Next, processing in the case where the skip mapper is executed in step 1920 of FIG. 18 will be described.

図２２は、本発明の一実施例のスキップ用Mapperの処理のフローチャートである。 FIG. 22 is a flowchart of the processing of the skip mapper according to the embodiment of this invention.

ステップ２６００） Reducerクラスの名前をConfigurationから取得し、通常のReducerクラスを生成する。 Step 2600) The Reducer class name is acquired from the Configuration, and a normal Reducer class is generated.

ステップ２６１０）ユーザの定義した任意の事前処理のみを行う。 Step 2610) Only any pre-processing defined by the user is performed.

ステップ２６２０）予め保存していた分散ファイルシステム１３３０上の中間データのキャッシュを読み込み、Key、Valueイテレータの形で取得する。 Step 2620) The intermediate data cache stored in the distributed file system 1330 previously stored is read and acquired in the form of a Key and Value iterator.

ステップ２６３０）割り当てられた中間データ（Key、Value）に対してReduce処理を行う。 Step 2630) Reduce processing is performed on the assigned intermediate data (Key, Value).

ステップ２６４０）ユーザの定義した任意の終了処理のみを行う。 Step 2640) Only an arbitrary end process defined by the user is performed.

なお、上記の図１に示す構成要素の動作をプログラムとして構築し、データ分析・機械学習装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operations of the components shown in FIG. 1 can be constructed as a program and installed in a computer used as a data analysis / machine learning device for execution, or distributed via a network.

また、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications are possible within the scope of the claims.

１０分散ファイルシステム（HDFS）
２０集計用データ作成部（Mapper）
３０中間データ記憶部
４０集計部（Reducer）
５０制御部（Controller）
６０結合部（Combiner）
７０シャッフル部（Shuffle）
１３００クライアントノード
１３０１ MapReduceアプリケーション
１３０２ジョブクライアント（JobClient）
１３１０ジョブトラッカ（JobTracker）ノード
１３１１ジョブトラッカ（JobTracker）
１３２０タスクトラッカ（TaskTracker）ノード
１３２１タスクトラッカ（Task Tracker）
１３２２ Map or Reduceタスク
１３３０分散ファイルシステム（HDFS） 10 Distributed file system (HDFS)
20 Data creation part for aggregation (Mapper)
30 Intermediate data storage section 40 Total section (Reducer)
50 Controller (Controller)
60 Combiner
70 Shuffle
1300 Client node 1301 MapReduce application 1302 Job client (JobClient)
1310 Job Tracker Node 1311 Job Tracker Job Tracker
1320 Task Tracker node 1321 Task Tracker
1322 Map or Reduce Task 1330 Distributed File System (HDFS)

Claims

Map means for generating one or more key / value pairs from one key / value pair, and one or more key / value pairs from a plurality of key / value pairs having the same key A data analysis / machine learning apparatus for generating a reduction (Reduce) means and a distributed file storage means for storing given teacher data, and for performing parallel distributed processing of large-scale data,
The map means and the reduce means are:
Detection of parameter values, use positions, and use source functions from the data as information related to parameters used by the map means and the reduce means during processing of input data, and storing in the distributed file storage means Having means,
Refer to the parameter given to the job, collate with the information about the parameter stored in the distributed file storage means, and when the predetermined condition is satisfied, the intermediate file is used as a cache to the distributed file storage means or local Intermediate data storage means for storing in the storage means;
With reference to the parameters given to the job, if the intermediate data is stored in the distributed file storage means or the local storage means, a skip processing means for executing only the reduce means,
Have
The reducing means includes
When the intermediate data is stored in the distributed file storage unit or the local storage unit, the distributed file storage unit or a unit that reads the intermediate data from the local storage unit and performs processing A data analysis / machine learning apparatus characterized by including:

The detection means includes
The class means of the said map means and the said reduction means, The usage order in the process of the said map means and the said reduction means of the parameter name and value given previously, The means to detect the function used are included, The means used Data analysis and machine learning device.

The intermediate data storage means includes
The intermediate data is stored as a cache when the processing of the mapping means depends on the parameter, the usage frequency of the parameter exceeds a predetermined threshold, or the user designates saving. The data analysis / machine learning apparatus according to claim 1, further comprising means.

Map means for generating one or more key / value pairs from one key / value pair, and one or more key / value pairs from a plurality of key / value pairs having the same key A data analysis / machine learning method in a device that has a reduction means to generate and a distributed file storage means for storing given teacher data and performs parallel distributed processing of large-scale data,
The map means and the reduce means detect a parameter value, a use position, and a use source function from the data as information on parameters used by the map means and the reduce means during processing of input data, Detecting in the distributed file storage means;
The intermediate data storage means refers to the parameter given to the job, collates with the information about the parameter stored in the distributed file storage means, and when the predetermined condition is satisfied, the intermediate data is used as a cache for the distributed file. An intermediate data storage step for storing in a storage means or a local storage means;
A skip processing unit refers to a parameter given to a job, and when the intermediate data is stored in the distributed file storage unit or the local storage unit, a skip process for executing only the reduce unit Steps,
When the intermediate data is stored in the distributed file storage unit or the local storage unit, the reducing unit reads the intermediate data from the distributed file storage unit or the local storage unit. The steps to process in
A data analysis / machine learning method characterized by

In the detection step,
5. The data analysis method according to claim 4, wherein a class name of the map means and the reduce means, a use order of the parameter names and values given in advance during the processing of the map means and the reduce means, and a function to be used are detected. Machine learning method.

In the intermediate data storing step,
The intermediate data is stored as a cache when the processing of the mapping means depends on the parameter, the usage frequency of the parameter exceeds a predetermined threshold, or the user designates saving. The data analysis / machine learning method according to claim 4.

A data analysis / machine learning program for causing a computer to function as each means constituting the data analysis / machine learning device according to any one of claims 1 to 3.