CN104503844B

CN104503844B - A kind of MapReduce operation fine grit classification methods based on multistage feature

Info

Publication number: CN104503844B
Application number: CN201410836410.XA
Authority: CN
Inventors: 贝振东; 喻之斌; 须成忠; 曾经纬; 田盼; 张慧玲
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2018-03-09
Anticipated expiration: 2034-12-29
Also published as: CN104503844A

Abstract

MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually classifying to the feature in each stage, similitude operation of each operation in different phase can be obtained, the operation of these similitudes optimizes in which can carry out similitude at this stage, and then MapReduce stage levels can be implemented rapid Optimum, so that the purpose of optimization is clearer and more definite, improve the efficiency of optimization, it is also beneficial to carry out the fine granularity bottleneck analysis of MapReduce workflows simultaneously using the result of classification, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve the performance of program in itself.

Description

A kind of MapReduce operation fine grit classification methods based on multistage feature

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of MapReduce operations based on multistage feature Fine grit classification method.

Background technology

MapReduce is distributed data processing programming model.MapReduce processing data processes are largely divided into 2 ranks Section：Map stages and Reduce stages.The Map stages are first carried out, then perform the Reduce stages.The Map stages can be further divided into Five sub-stages of read, map, collect, spill and merge, Reduce stages can also be further divided into shuffle, Tetra- sub-stages of sort, reduce and write.

Performance characteristic data during MapReduce operation can be obtained by distributed monitoring system Ganglia. Ganglia is the cluster observation project of increasing income that UC Berkeley are initiated, designed for measuring thousands of nodes. Ganglia core includes gmond, gmetad and a web front end.Monitor system performance is primarily used to, such as：cpu、 Mem, hard disk utilization, I/O loads, network traffic conditions etc., it is easy to see the working condition of each node by curve, it is right Reasonable adjustment, distributing system resource, improving systematic entirety can play an important role.Every computer all run one collect and Send the entitled gmond of metric data finger daemon.Receive the main frame of all metric datas can show these data and The list of simplifying of these data can be delivered in hierarchical structure.Just because of there is this hierarchical structure pattern, just cause Ganglia can realize good extension.The system load that gmond is brought is considerably less, and this causes it to turn into each in the cluster The one section of code run on computer, without influenceing user performance.All these data, which are repeatedly collected, can influence joint behavior. " shake " in network occurs when a large amount of small message occur simultaneously, can be by the way that nodal clock is consistent, to avoid this Individual problem.Gmetad can be deployed in any node or the unique host by network connection to cluster in cluster, and it is logical The mode for crossing singlecast router communicates with gmond, the status information of collecting zone interior nodes, and in the form of XML data, is stored in In database.For the classification problem of MapReduce operations, overall performance when presently mainly being run for MapReduce Data are analyzed, and refinement ground considers the performance characteristic in each stage, but in practice, not same order during job run The performance of section is that different, so different homework type has the performance bottleneck of different phase, it is necessary to according to not With stage performance bottleneck targetedly take different methods to be analyzed and tuning.It is therefore desirable to study and formulate A set of fine granularity ground MapReduce job class methods with realizing operation precise classification.

The content of the invention

Have in view of that, it is necessary to provide it is a kind of can be to the fine grit classification method of MapReduce operations.

To achieve the above object, the present invention uses following technical proposals：

A kind of MapReduce operation fine grit classification methods based on multistage feature, comprise the steps：

Step S110：The performance data of each node in Hadoop clusters is collected, the performance data uses including CPU Rate, memory usage and I/O utilization rates,

The collection of the CPU usage is combined into C_jm={ C1_jm, C2_jm... Cn_jm}；

The collection of the memory usage is combined into M_jm={ M1_jm, M2_jm... Mn_jm}；

The collection of the I/O utilization rates is combined into I_jm={ I1_jm, I2_jm... In_jm}；

Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ J1, j2 ... Jm }, n is the number of nodes of Hadoop clusters；

Step S120：The set of the average of the performance data of each operation is counted, is designated as respectively：

CMean_jm=(C1_jm+C2_jm+...+Cn_jm)/n；

MMean_jm=(M1_jm+M2_jm+...+Mn_jm)/n；

IMean_jm=(I1_jm+I2_jm+...+In_jm)/n；

Step S130：The average of the performance data of each operation was divided into for 9 stages, is designated as respectively：

CMean_jm={ CM1_jm, CM2_jm... CM9_jm}；

MMean_jm={ MM1_jm, MM2_jm, MM9_jm}；

IMean_jm={ IM1_jm, IM2_jm, IM9_jm}；

Step S140：Hierarchical clustering is carried out to each stage respectively using hierarchical clustering.

Preferably, in step S140, layer is carried out to each stage in MapReduce runnings respectively using hierarchical clustering Secondary cluster, comprises the steps：

Step S141：In m class operation quantity characteristic vector Fi will be used per a kind of_jmIt is described, wherein, Fi_jm= {CMi_jm, MMi_jm, IMi_jm}；

Step S142：Find immediate two classes and be merged into one kind；

Step S143：Recalculate the Euclidean distance between new class and had been friends in the past class；

Step S144：Repeat step S142 is walked and step S143, untill being to the last merged into one kind.

Preferably, in step S142, immediate two classes are found and are merged into one kind it is specially：For i-th of stage M characteristic vector, calculate m characteristic vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, The minimum combination of distance is found, two classes during this is combined, which merge, turns into a new class.

Preferably, the computational methods of the Euclidean distance are：

Wherein, the characteristic vector that there were two operations in the i-th stage is：

Fi_jo={ CMi_jo, MMi_jo, IMi_jo, Fi_jp={ CMi_jp, MMi_jp, IMi_jp}。

MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually to every The feature in individual stage is classified, and can obtain similitude operation of each operation in different phase, the operation of these similitudes Optimize while similitude can be carried out at this stage, and then MapReduce stage levels can be implemented rapid Optimum so that optimization Purpose is clearer and more definite, improves the efficiency of optimization, while is also beneficial to carry out the thin of MapReduce workflows using the result of classification Granularity bottleneck analysis, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve journey The performance of sequence in itself.

Brief description of the drawings

Fig. 1 is the MapReduce operation fine grit classification methods based on multistage feature of sampling provided by the invention Flow chart of steps.

Fig. 2 is the step flow chart for carrying out hierarchical clustering to each stage respectively using hierarchical clustering.

Embodiment

In order that the purpose of the present invention, technical scheme and beneficial effect are more clearly understood, below in conjunction with accompanying drawing and implementation Example, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only explaining this hair It is bright, it is not intended to limit the present invention.

Referring to Fig. 1, it is the MapReduce operation fine grit classification methods provided by the invention based on multistage feature Flow chart of steps, comprise the steps：

In the present embodiment, operation needs what is classified in the Ganglia Hadoop clusters of n node are provided with The set Job={ J1, j2...jm } of m operation, each operation is collected from Ganglia database module and is transported in n node Performance data during row.

CMean_jm=(C1_jm+C2_jm+...+Cn_jm)/n；

MMean_jm=(M1_jm+M2_jm+...+Mn_jm)/n；

IMean_jm=(I1_jm+I2_jm+...+In_jm)/n；

Step S130：The average of the performance data of each operation is divided into 9 stages, is designated as respectively：

CMean_jm={ CM1_jm, CM2_jm... CM9_jm}；

MMean_jm={ MM1_jm, MM2_jm, MM9_jm}；

IMean_jm={ IM1_jm, IM2_jm, IM9_jm}；

It is appreciated that when can determine the start-stop in each stage in MapReduce runnings according to hadoop daily record Between, according to these beginning and ending times, the average value of three set of the performance data of each operation can be divided into 9 sections.

Step S140：Hierarchical clustering is carried out to each stage in MapReduce runnings respectively using hierarchical clustering.

It is appreciated that carry out hierarchical clustering for each stage respectively using hierarchical clustering, thus need to carry out 9 times poly- Class, can so obtain similitude operation of each operation in different phase, and the operation of these similitudes can enter at this stage Optimize to row similitude, and then MapReduce stage levels can be implemented rapid Optimum.

Referring to Fig. 2, carrying out hierarchical clustering to each stage respectively using hierarchical clustering, comprise the steps：

Step S142：Find immediate two classes and be merged into one kind；

Specifically, for the m characteristic vector in i-th of stage, calculate the Euclid of m characteristic vector between any two away from From obtaining the distance of m × (m-1) individual combination, find the minimum combination of distance, two classes during this is combined, which merge, turns into one Individual new class.

Wherein, the computational methods of the Euclidean distance are：

Fi_jo={ CMi_jo, MMi_jo, IMi_jo, Fi_jp={ CMi_jp, MMi_jp, IMi_jp}。

It should be noted that：In the various embodiments described above, the description of each embodiment is all given priority to, and is not had in each embodiment The part being described in detail, the detailed description of specification full text is referred to, here is omitted.

Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims

A kind of 1. MapReduce operation fine grit classification methods based on multistage feature, it is characterised in that including following steps Suddenly：

Step S110：The performance data of each node in Hadoop clusters is collected, the performance data includes CPU usage, interior Utilization rate and I/O utilization rates are deposited,

The collection of the CPU usage is combined into C_jm={ C1_jm,C2_jm,…Cn_jm}；

The collection of the memory usage is combined into M_jm={ M1_jm,M2_jm,…Mn_jm}；

The collection of the I/O utilization rates is combined into I_jm={ I1_jm,I2_jm,…In_jm}；

Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ j1, j2 ... jm }, N is the number of nodes of Hadoop clusters；

Step S120：The set of the average of the performance data of each operation is counted, is designated as respectively：

CMean_jm=(C1_jm+C2_jm+…+Cn_jm)/n；

MMean_jm=(M1_jm+M2_jm+…+Mn_jm)/n；

IMean_jm=(I1_jm+I2_jm+…+In_jm)/n；

Step S130：The average of the performance data of each operation is divided into 9 stages, CMean_jmIt is designated as：{CM1_jm,CM2_jm,… CM9_jm}；MMean_jmIt is designated as：{MM1_jm,MM2_jm,MM9_jm}；IMean_jmIt is designated as：{IM1_jm,IM2_jm,IM9_jm}；

Step S140：Hierarchical clustering is carried out to each stage in MapReduce runnings respectively using hierarchical clustering.
2. the MapReduce operation fine grit classification methods based on multistage feature, its feature exist as claimed in claim 1 In, in step S140, using hierarchical clustering respectively to each stage carry out hierarchical clustering, comprise the steps：

Step S141：In m class operation quantity characteristic vector Fi will be used per a kind of_jmIt is described, wherein, Fi_jm={ CMi_jm, MMi_jm,IMi_jm}；Wherein i represents to be directed to i-th of stage；

Step S142：Find immediate two classes and be merged into one kind；

Step S143：Recalculate the Euclidean distance between new class and had been friends in the past class；

Step S144：Repeat step S142 is walked and step S143, untill being to the last merged into one kind.
3. the MapReduce operation fine grit classification methods based on multistage feature, its feature exist as claimed in claim 2 In in step S142, finding immediate two classes and be merged into one kind and be specially：For i-th of stage m feature to Amount, the Euclidean distance of m characteristic vector between any two is calculated, obtains the distance of m × (m-1) individual combination, finds distance most Small combination, two classes during this is combined, which merge, turns into a new class.
4. the MapReduce operation fine grit classification methods based on multistage feature as claimed in claim 2 or claim 3, its feature It is, the computational methods of the Euclidean distance are：

<mrow> <mi>D</mi> <mi>i</mi> <mi>s</mi> <mi>F</mi> <mi>i</mi> <mo>=</mo> <msqrt> <mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>CMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>CMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>MMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>MMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>IMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>IMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> </mrow>

Wherein, the characteristic vector that there were two operations in the i-th stage is：

Fi_jo={ CMi_jo,MMi_jo,IMi_jo},Fi_jp={ CMi_jp,MMi_jp,IMi_jp}。