CN104503844A

CN104503844A - MapReduce operation fine granularity sorting method based on multi-stage characteristics

Info

Publication number: CN104503844A
Application number: CN201410836410.XA
Authority: CN
Inventors: 贝振东; 喻之斌; 须成忠; 曾经纬; 田盼; 张慧玲
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2015-04-08
Anticipated expiration: 2034-12-29
Also published as: CN104503844B

Abstract

The invention provides a MapReduce operation fine granularity sorting method based on multi-stage characteristics. By independently sorting the characteristics of each stage, similar operation of each operation in different stages can be obtained, and the similar operation can be similarly optimized in the stage, so that stage-by-stage rapid optimization on MapReduce can be implemented, the optimization purpose is clearer, the optimization efficiency is improved, meanwhile, fine granularity choke point analysis on a MapReduce workflow can be facilitated by utilizing the sorting results, the program design can be improved in a targeted manner by finding out a choke point which restricts the running performance of a program, and the performances of the program are improved.

Description

A kind of MapReduce operation fine grit classification method based on multistage feature

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of MapReduce operation fine grit classification method based on multistage feature.

Background technology

MapReduce is distributed data processing programming model.MapReduce process data procedures is mainly divided into 2 stages: Map stage and Reduce stage.First perform the Map stage, then perform the Reduce stage.The Map stage can Further Division be read, map, collect, spill and merge five subs, and the Reduce stage can Further Division be also shuffle, sort, reduce and write tetra-subs.

Performance characteristic data during the operation of MapReduce can be obtained by distributed monitoring system Ganglia.Ganglia is the cluster observation project of increasing income that UC Berkeley initiates, and is designed for and measures thousands of nodes.The core of Ganglia comprises gmond, gmetad and a web front end.Mainly be used for monitor system performance, as: cpu, mem, hard disk utilization factor, I/O load, network traffic conditions etc., be easy to the duty seeing each node by curve, to Reasonable adjustment, distributing system resource, improve entire system performance and play an important role.Every platform computing machine all runs the finger daemon that was collected and sent the gmond by name of metric data.The main frame receiving all metric datas can show these data and the list of simplifying of these data can be delivered in hierarchical structure.Just because of there is this hierarchical structure pattern, just make Ganglia can realize good expansion.The system load that gmond brings is considerably less, and this makes it become one section of code that each computing machine in the cluster runs, and can not affect user performance.All these data are repeatedly collected can affect joint behavior." shake " in network occurs in a large amount of little message when occurring simultaneously, by being consistent by nodal clock, can avoid this problem.Gmetad can be deployed in arbitrary node in cluster or be connected to the unique host of cluster by network, and it is communicated with gmond by the mode of singlecast router, the status information of collecting zone interior nodes, and with the form of XML data, preserves in a database.For the classification problem of MapReduce operation, when mainly running for MapReduce at present, the performance data of entirety is analyzed, the performance characteristic in each stage is not considered on refinement ground, but in practice, during job run, the performance of different phase is but different, homework types different like this has the performance bottleneck of different phase, needs to take diverse ways to carry out analyzing and tuning targetedly according to ground performance bottleneck of different stages.So, need research and formulate a set of fine granularity ground MapReduce job class method to realize operation ground precise classification.

Summary of the invention

Have in view of that, be necessary to provide a kind of can to the fine grit classification method of MapReduce operation.

For achieving the above object, the present invention adopts following technical proposals:

Based on a MapReduce operation fine grit classification method for multistage feature, comprise the steps:

Step S110: the performance data of collecting each node in Hadoop cluster, described performance data comprises CPU usage, memory usage and I/O utilization rate,

The set of described CPU usage is C _jm={ C1 _jm, C2 _jm... Cn _jm;

The set of described memory usage is M _jm={ M1 _jm, M2 _jm... Mn _jm;

The set of described I/O utilization rate is I _jm={ I1 _jm, I2 _jm... In _jm;

Wherein, m is described Hadoop cluster operation quantity to be sorted, and the set of m operation is designated as Job={J1, j2 ... jm}, n are the number of nodes of Hadoop cluster;

Step S120: the set adding up the average of the performance data of each operation, is designated as respectively:

CMean _jm＝(C1 _jm+C2 _jm+...+Cn _jm)/n；

MMean _jm＝(M1 _jm+M2 _jm+...+Mn _jm)/n；

IMean _jm＝(I1 _jm+I2 _jm+...+In _jm)/n；

Step S130: the average of the performance data of each operation was divided into for 9 stages, is designated as respectively:

CMean _jm＝{CM1 _jm，CM2 _jm，...CM9 _jm}；

MMean _jm＝{MM1 _jm，MM2 _jm，MM9 _jm}；

IMean _jm＝{IM1 _jm，IM2 _jm，IM9 _jm}；

Step S140: utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively.

Preferably, in step S140, utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively, comprise the steps:

Step S141: by each class proper vector Fi in m class operation quantity _jmbe described, wherein, Fi _jm={ CMi _jm, MMi _jm, IMi _jm;

Step S142: find immediate two classes and be merged into a class;

Step S143: recalculate new class and the Euclidean distance of haveing been friends in the past between class;

Step S144: repeat step S142 step and step S143, till being to the last merged into a class.

Preferably, in step S142, find immediate two classes and be merged into a class and be specially: for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.

Preferably, the computing method of described Euclidean distance are:

DisFi = \sqrt{{({CMi}_{jo} - {CMi}_{jp})}^{2} + {({MMi}_{jo} - {MMi}_{jp})}^{2} {({IMi}_{jo} - {IMi}_{jp})}^{2}}

Wherein, the i-th stage had the proper vector of two operations to be:

Fi _jo＝{CMi _jo，MMi _jo，IMi _jo}，Fi _jp＝{CMi _jp，MMi _jp，IMi _jp}。

MapReduce operation fine grit classification method based on multistage feature provided by the invention, classified by the independent feature to each stage, the similarity operation of each operation in different phase can be obtained, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then MapReduce stage level ground rapid Optimum can be implemented, make the object of optimization clearer and more definite, improve the efficiency of optimization, also be conducive to utilizing the result of classification to carry out the fine granularity bottleneck analysis of MapReduce workflow simultaneously, find out the design that restriction program runnability ground bottleneck can improve program more targetedly, the performance of raising program itself.

Accompanying drawing explanation

Fig. 1 is the flow chart of steps of the MapReduce operation fine grit classification method based on multistage feature of sampling provided by the invention.

Fig. 2 is the flow chart of steps utilizing hierarchical clustering respectively every one-phase to be carried out to hierarchical clustering.

Embodiment

In order to make object of the present invention, technical scheme and beneficial effect clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

Refer to Fig. 1, be the flow chart of steps of the MapReduce operation fine grit classification method based on multistage feature provided by the invention, comprise the steps:

The set of described CPU usage is C _jm={ C1 _jm, C2 _jm... Cn _jm;

The set of described memory usage is M _jm={ M1 _jm, M2 _jm... Mn _jm;

The set of described I/O utilization rate is I _jm={ I1 _jm, I2 _jm... In _jm;

In the present embodiment, the set Job={J1 needing to carry out m the operation of classifying is run in the Hadoop cluster of n node being provided with Ganglia, j2...jm}, collects the performance data of each operation when n node runs from the database module of Ganglia.

CMean _jm＝(C1 _jm+C2 _jm+...+Cn _jm)/n；

MMean _jm＝(M1 _jm+M2 _jm+...+Mn _jm)/n；

IMean _jm＝(I1 _jm+I2 _jm+...+In _jm)/n；

Step S130: the average of the performance data of each operation is divided into 9 stages, is designated as respectively:

CMean _jm＝{CM1 _jm，CM2 _jm，...CM9 _jm}；

MMean _jm＝{MM1 _jm，MM2 _jm，MM9 _jm}；

IMean _jm＝{IM1 _jm，IM2 _jm，IM9 _jm}；

The daily record being appreciated that according to hadoop can determine the beginning and ending time in each stage in MapReduce operational process, according to these beginning and ending times, the mean value of three of the performance data of an each operation set can be divided into 9 sections.

Step S140: utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively.

Be appreciated that, hierarchical clustering is utilized to carry out hierarchical clustering for each stage respectively, need the cluster of carrying out 9 times thus, the similarity operation of each operation in different phase can be obtained like this, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then can implement MapReduce stage level ground rapid Optimum.

Refer to Fig. 2, utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively, comprise the steps:

Step S142: find immediate two classes and be merged into a class;

Particularly, for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.

Wherein, the computing method of described Euclidean distance are:

DisFi = \sqrt{{({CMi}_{jo} - {CMi}_{jp})}^{2} + {({MMi}_{jo} - {MMi}_{jp})}^{2} {({IMi}_{jo} - {IMi}_{jp})}^{2}}

Wherein, the i-th stage had the proper vector of two operations to be:

It should be noted that: in the various embodiments described above, the description of each embodiment is all given priority to, and does not have the part described in detail in each embodiment, with reference to instructions detailed description in full, can repeat no more herein.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1., based on a MapReduce operation fine grit classification method for multistage feature, it is characterized in that, comprise the steps:

The set of described CPU usage is C _jm={ C1 _jm, C2 _jm... Cn _jm;

The set of described memory usage is M _jm={ M1 _jm, M2 _jm... Mn _jm;

The set of described I/O utilization rate is I _jm={ I1 _jm, I2 _jm... In _jm;

CMean _jm＝(C1 _jm+C2 _jm+…+Cn _jm)/n；

MMean _jm＝(M1 _jm+M2 _jm+…+Mn _jm)/n；

IMean _jm＝(I1 _jm+I2 _jm+…+In _jm)/n；

CMean _jm＝{CM1 _jm，CM2 _jm，...CM9 _jm}；

MMean _jm＝{MM1 _jm，MM2 _jm，MM9 _jm}；

IMean _jm＝{IM1 _jm，IM2 _jm，IM9 _jm}；

2., as claimed in claim 1 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, in step S140, utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively, comprise the steps:

Step S142: find immediate two classes and be merged into a class;

3. as claimed in claim 2 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, in step S142, find immediate two classes and be merged into a class and be specially: for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.

4., as claimed in claim 2 or claim 3 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, the computing method of described Euclidean distance are:

DisFi = \sqrt{{({CMi}_{jo} - {CMi}_{jp})}^{2} + {({MMi}_{jo} - {MMi}_{jp})}^{2} + {({IMi}_{jo} - {IMi}_{jp})}^{2}}

Wherein, the i-th stage had the proper vector of two operations to be:

Fi _jo＝{CMi _jo，MMi _jo，IMi _jo},Fi _jp＝{CMi _jp，MMi _jp，IMi _jp}。