CN104503844A - MapReduce operation fine granularity sorting method based on multi-stage characteristics - Google Patents

MapReduce operation fine granularity sorting method based on multi-stage characteristics Download PDF

Info

Publication number
CN104503844A
CN104503844A CN201410836410.XA CN201410836410A CN104503844A CN 104503844 A CN104503844 A CN 104503844A CN 201410836410 A CN201410836410 A CN 201410836410A CN 104503844 A CN104503844 A CN 104503844A
Authority
CN
China
Prior art keywords
stage
class
mapreduce
cmi
imi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410836410.XA
Other languages
Chinese (zh)
Other versions
CN104503844B (en
Inventor
贝振东
喻之斌
须成忠
曾经纬
田盼
张慧玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201410836410.XA priority Critical patent/CN104503844B/en
Publication of CN104503844A publication Critical patent/CN104503844A/en
Application granted granted Critical
Publication of CN104503844B publication Critical patent/CN104503844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a MapReduce operation fine granularity sorting method based on multi-stage characteristics. By independently sorting the characteristics of each stage, similar operation of each operation in different stages can be obtained, and the similar operation can be similarly optimized in the stage, so that stage-by-stage rapid optimization on MapReduce can be implemented, the optimization purpose is clearer, the optimization efficiency is improved, meanwhile, fine granularity choke point analysis on a MapReduce workflow can be facilitated by utilizing the sorting results, the program design can be improved in a targeted manner by finding out a choke point which restricts the running performance of a program, and the performances of the program are improved.

Description

A kind of MapReduce operation fine grit classification method based on multistage feature
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of MapReduce operation fine grit classification method based on multistage feature.
Background technology
MapReduce is distributed data processing programming model.MapReduce process data procedures is mainly divided into 2 stages: Map stage and Reduce stage.First perform the Map stage, then perform the Reduce stage.The Map stage can Further Division be read, map, collect, spill and merge five subs, and the Reduce stage can Further Division be also shuffle, sort, reduce and write tetra-subs.
Performance characteristic data during the operation of MapReduce can be obtained by distributed monitoring system Ganglia.Ganglia is the cluster observation project of increasing income that UC Berkeley initiates, and is designed for and measures thousands of nodes.The core of Ganglia comprises gmond, gmetad and a web front end.Mainly be used for monitor system performance, as: cpu, mem, hard disk utilization factor, I/O load, network traffic conditions etc., be easy to the duty seeing each node by curve, to Reasonable adjustment, distributing system resource, improve entire system performance and play an important role.Every platform computing machine all runs the finger daemon that was collected and sent the gmond by name of metric data.The main frame receiving all metric datas can show these data and the list of simplifying of these data can be delivered in hierarchical structure.Just because of there is this hierarchical structure pattern, just make Ganglia can realize good expansion.The system load that gmond brings is considerably less, and this makes it become one section of code that each computing machine in the cluster runs, and can not affect user performance.All these data are repeatedly collected can affect joint behavior." shake " in network occurs in a large amount of little message when occurring simultaneously, by being consistent by nodal clock, can avoid this problem.Gmetad can be deployed in arbitrary node in cluster or be connected to the unique host of cluster by network, and it is communicated with gmond by the mode of singlecast router, the status information of collecting zone interior nodes, and with the form of XML data, preserves in a database.For the classification problem of MapReduce operation, when mainly running for MapReduce at present, the performance data of entirety is analyzed, the performance characteristic in each stage is not considered on refinement ground, but in practice, during job run, the performance of different phase is but different, homework types different like this has the performance bottleneck of different phase, needs to take diverse ways to carry out analyzing and tuning targetedly according to ground performance bottleneck of different stages.So, need research and formulate a set of fine granularity ground MapReduce job class method to realize operation ground precise classification.
Summary of the invention
Have in view of that, be necessary to provide a kind of can to the fine grit classification method of MapReduce operation.
For achieving the above object, the present invention adopts following technical proposals:
Based on a MapReduce operation fine grit classification method for multistage feature, comprise the steps:
Step S110: the performance data of collecting each node in Hadoop cluster, described performance data comprises CPU usage, memory usage and I/O utilization rate,
The set of described CPU usage is C jm={ C1 jm, C2 jm... Cn jm;
The set of described memory usage is M jm={ M1 jm, M2 jm... Mn jm;
The set of described I/O utilization rate is I jm={ I1 jm, I2 jm... In jm;
Wherein, m is described Hadoop cluster operation quantity to be sorted, and the set of m operation is designated as Job={J1, j2 ... jm}, n are the number of nodes of Hadoop cluster;
Step S120: the set adding up the average of the performance data of each operation, is designated as respectively:
CMean jm=(C1 jm+C2 jm+...+Cn jm)/n;
MMean jm=(M1 jm+M2 jm+...+Mn jm)/n;
IMean jm=(I1 jm+I2 jm+...+In jm)/n;
Step S130: the average of the performance data of each operation was divided into for 9 stages, is designated as respectively:
CMean jm={CM1 jm,CM2 jm,...CM9 jm};
MMean jm={MM1 jm,MM2 jm,MM9 jm};
IMean jm={IM1 jm,IM2 jm,IM9 jm};
Step S140: utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively.
Preferably, in step S140, utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively, comprise the steps:
Step S141: by each class proper vector Fi in m class operation quantity jmbe described, wherein, Fi jm={ CMi jm, MMi jm, IMi jm;
Step S142: find immediate two classes and be merged into a class;
Step S143: recalculate new class and the Euclidean distance of haveing been friends in the past between class;
Step S144: repeat step S142 step and step S143, till being to the last merged into a class.
Preferably, in step S142, find immediate two classes and be merged into a class and be specially: for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.
Preferably, the computing method of described Euclidean distance are:
DisFi = ( CMi jo - CMi jp ) 2 + ( MMi jo - MMi jp ) 2 ( IMi jo - IMi jp ) 2
Wherein, the i-th stage had the proper vector of two operations to be:
Fi jo={CMi jo,MMi jo,IMi jo},Fi jp={CMi jp,MMi jp,IMi jp}。
MapReduce operation fine grit classification method based on multistage feature provided by the invention, classified by the independent feature to each stage, the similarity operation of each operation in different phase can be obtained, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then MapReduce stage level ground rapid Optimum can be implemented, make the object of optimization clearer and more definite, improve the efficiency of optimization, also be conducive to utilizing the result of classification to carry out the fine granularity bottleneck analysis of MapReduce workflow simultaneously, find out the design that restriction program runnability ground bottleneck can improve program more targetedly, the performance of raising program itself.
Accompanying drawing explanation
Fig. 1 is the flow chart of steps of the MapReduce operation fine grit classification method based on multistage feature of sampling provided by the invention.
Fig. 2 is the flow chart of steps utilizing hierarchical clustering respectively every one-phase to be carried out to hierarchical clustering.
Embodiment
In order to make object of the present invention, technical scheme and beneficial effect clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
Refer to Fig. 1, be the flow chart of steps of the MapReduce operation fine grit classification method based on multistage feature provided by the invention, comprise the steps:
Step S110: the performance data of collecting each node in Hadoop cluster, described performance data comprises CPU usage, memory usage and I/O utilization rate,
The set of described CPU usage is C jm={ C1 jm, C2 jm... Cn jm;
The set of described memory usage is M jm={ M1 jm, M2 jm... Mn jm;
The set of described I/O utilization rate is I jm={ I1 jm, I2 jm... In jm;
Wherein, m is described Hadoop cluster operation quantity to be sorted, and the set of m operation is designated as Job={J1, j2 ... jm}, n are the number of nodes of Hadoop cluster;
In the present embodiment, the set Job={J1 needing to carry out m the operation of classifying is run in the Hadoop cluster of n node being provided with Ganglia, j2...jm}, collects the performance data of each operation when n node runs from the database module of Ganglia.
Step S120: the set adding up the average of the performance data of each operation, is designated as respectively:
CMean jm=(C1 jm+C2 jm+...+Cn jm)/n;
MMean jm=(M1 jm+M2 jm+...+Mn jm)/n;
IMean jm=(I1 jm+I2 jm+...+In jm)/n;
Step S130: the average of the performance data of each operation is divided into 9 stages, is designated as respectively:
CMean jm={CM1 jm,CM2 jm,...CM9 jm};
MMean jm={MM1 jm,MM2 jm,MM9 jm};
IMean jm={IM1 jm,IM2 jm,IM9 jm};
The daily record being appreciated that according to hadoop can determine the beginning and ending time in each stage in MapReduce operational process, according to these beginning and ending times, the mean value of three of the performance data of an each operation set can be divided into 9 sections.
Step S140: utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively.
Be appreciated that, hierarchical clustering is utilized to carry out hierarchical clustering for each stage respectively, need the cluster of carrying out 9 times thus, the similarity operation of each operation in different phase can be obtained like this, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then can implement MapReduce stage level ground rapid Optimum.
Refer to Fig. 2, utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively, comprise the steps:
Step S141: by each class proper vector Fi in m class operation quantity jmbe described, wherein, Fi jm={ CMi jm, MMi jm, IMi jm;
Step S142: find immediate two classes and be merged into a class;
Particularly, for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.
Step S143: recalculate new class and the Euclidean distance of haveing been friends in the past between class;
Step S144: repeat step S142 step and step S143, till being to the last merged into a class.
Wherein, the computing method of described Euclidean distance are:
DisFi = ( CMi jo - CMi jp ) 2 + ( MMi jo - MMi jp ) 2 ( IMi jo - IMi jp ) 2
Wherein, the i-th stage had the proper vector of two operations to be:
Fi jo={CMi jo,MMi jo,IMi jo},Fi jp={CMi jp,MMi jp,IMi jp}。
MapReduce operation fine grit classification method based on multistage feature provided by the invention, classified by the independent feature to each stage, the similarity operation of each operation in different phase can be obtained, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then MapReduce stage level ground rapid Optimum can be implemented, make the object of optimization clearer and more definite, improve the efficiency of optimization, also be conducive to utilizing the result of classification to carry out the fine granularity bottleneck analysis of MapReduce workflow simultaneously, find out the design that restriction program runnability ground bottleneck can improve program more targetedly, the performance of raising program itself.
It should be noted that: in the various embodiments described above, the description of each embodiment is all given priority to, and does not have the part described in detail in each embodiment, with reference to instructions detailed description in full, can repeat no more herein.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (4)

1., based on a MapReduce operation fine grit classification method for multistage feature, it is characterized in that, comprise the steps:
Step S110: the performance data of collecting each node in Hadoop cluster, described performance data comprises CPU usage, memory usage and I/O utilization rate,
The set of described CPU usage is C jm={ C1 jm, C2 jm... Cn jm;
The set of described memory usage is M jm={ M1 jm, M2 jm... Mn jm;
The set of described I/O utilization rate is I jm={ I1 jm, I2 jm... In jm;
Wherein, m is described Hadoop cluster operation quantity to be sorted, and the set of m operation is designated as Job={J1, j2 ... jm}, n are the number of nodes of Hadoop cluster;
Step S120: the set adding up the average of the performance data of each operation, is designated as respectively:
CMean jm=(C1 jm+C2 jm+…+Cn jm)/n;
MMean jm=(M1 jm+M2 jm+…+Mn jm)/n;
IMean jm=(I1 jm+I2 jm+…+In jm)/n;
Step S130: the average of the performance data of each operation was divided into for 9 stages, is designated as respectively:
CMean jm={CM1 jm,CM2 jm,...CM9 jm};
MMean jm={MM1 jm,MM2 jm,MM9 jm};
IMean jm={IM1 jm,IM2 jm,IM9 jm};
Step S140: utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively.
2., as claimed in claim 1 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, in step S140, utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively, comprise the steps:
Step S141: by each class proper vector Fi in m class operation quantity jmbe described, wherein, Fi jm={ CMi jm, MMi jm, IMi jm;
Step S142: find immediate two classes and be merged into a class;
Step S143: recalculate new class and the Euclidean distance of haveing been friends in the past between class;
Step S144: repeat step S142 step and step S143, till being to the last merged into a class.
3. as claimed in claim 2 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, in step S142, find immediate two classes and be merged into a class and be specially: for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.
4., as claimed in claim 2 or claim 3 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, the computing method of described Euclidean distance are:
DisFi = ( CMi jo - CMi jp ) 2 + ( MMi jo - MMi jp ) 2 + ( IMi jo - IMi jp ) 2
Wherein, the i-th stage had the proper vector of two operations to be:
Fi jo={CMi jo,MMi jo,IMi jo},Fi jp={CMi jp,MMi jp,IMi jp}。
CN201410836410.XA 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature Active CN104503844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410836410.XA CN104503844B (en) 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410836410.XA CN104503844B (en) 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature

Publications (2)

Publication Number Publication Date
CN104503844A true CN104503844A (en) 2015-04-08
CN104503844B CN104503844B (en) 2018-03-09

Family

ID=52945244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410836410.XA Active CN104503844B (en) 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature

Country Status (1)

Country Link
CN (1) CN104503844B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915260A (en) * 2015-06-19 2015-09-16 北京搜狐新媒体信息技术有限公司 Hadoop cluster management task distributing method and system
CN106126407A (en) * 2016-06-22 2016-11-16 西安交通大学 A kind of performance monitoring Operation Optimization Systerm for distributed memory system and method
CN106341478A (en) * 2016-09-13 2017-01-18 广州中大数字家庭工程技术研究中心有限公司 Education resource sharing system based on Hadoop and realization method
CN110543588A (en) * 2019-08-27 2019-12-06 中国科学院软件研究所 Distributed clustering method and system for large-scale stream data
CN110704515A (en) * 2019-12-11 2020-01-17 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082599A1 (en) * 2008-09-30 2010-04-01 Goetz Graefe Characterizing Queries To Predict Execution In A Database
CN102004670A (en) * 2009-12-17 2011-04-06 华中科技大学 Self-adaptive job scheduling method based on MapReduce
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN103605576A (en) * 2013-11-25 2014-02-26 华中科技大学 Multithreading-based MapReduce execution system
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN103701635A (en) * 2013-12-10 2014-04-02 中国科学院深圳先进技术研究院 Method and device for configuring Hadoop parameters on line

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082599A1 (en) * 2008-09-30 2010-04-01 Goetz Graefe Characterizing Queries To Predict Execution In A Database
CN102004670A (en) * 2009-12-17 2011-04-06 华中科技大学 Self-adaptive job scheduling method based on MapReduce
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN103605576A (en) * 2013-11-25 2014-02-26 华中科技大学 Multithreading-based MapReduce execution system
CN103701635A (en) * 2013-12-10 2014-04-02 中国科学院深圳先进技术研究院 Method and device for configuring Hadoop parameters on line

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冀素琴,石洪波: "基于mapreduce的k-means聚类集成", 《计算机工程》 *
鲁伟明,杜晨阳等: "基于mapreduce的分布式近邻传播聚类算法", 《计算机研究与发展》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915260A (en) * 2015-06-19 2015-09-16 北京搜狐新媒体信息技术有限公司 Hadoop cluster management task distributing method and system
CN104915260B (en) * 2015-06-19 2018-05-25 北京搜狐新媒体信息技术有限公司 A kind of distribution method and system of Hadoop cluster managements task
CN106126407A (en) * 2016-06-22 2016-11-16 西安交通大学 A kind of performance monitoring Operation Optimization Systerm for distributed memory system and method
CN106126407B (en) * 2016-06-22 2018-07-17 西安交通大学 A kind of performance monitoring Operation Optimization Systerm and method for distributed memory system
CN106341478A (en) * 2016-09-13 2017-01-18 广州中大数字家庭工程技术研究中心有限公司 Education resource sharing system based on Hadoop and realization method
CN110543588A (en) * 2019-08-27 2019-12-06 中国科学院软件研究所 Distributed clustering method and system for large-scale stream data
CN110704515A (en) * 2019-12-11 2020-01-17 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model

Also Published As

Publication number Publication date
CN104503844B (en) 2018-03-09

Similar Documents

Publication Publication Date Title
Zhang et al. Negative-aware attention framework for image-text matching
Panda et al. Planet: massively parallel learning of tree ensembles with mapreduce
CN104503844B (en) A kind of MapReduce operation fine grit classification methods based on multistage feature
Zhang et al. A distributed frequent itemset mining algorithm using Spark for Big Data analytics
CN101996250B (en) Hadoop-based mass stream data storage and query method and system
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
US20150095381A1 (en) Method and apparatus for managing time series database
US20140207820A1 (en) Method for parallel mining of temporal relations in large event file
CN102799486A (en) Data sampling and partitioning method for MapReduce system
CN105959372A (en) Internet user data analysis method based on mobile application
Zhao et al. Positive and unlabeled learning for graph classification
US20150120637A1 (en) Apparatus and method for analyzing bottlenecks in data distributed data processing system
CN103942308A (en) Method and device for detecting large-scale social network communities
CN108984744A (en) A kind of non-master chain block self-propagation method
Aiello et al. Behavior-driven clustering of queries into topics
CN105389471A (en) Method for reducing training set of machine learning
CN111538766A (en) Text classification method, device, processing equipment and bill classification system
Liu et al. Reinforcement graph clustering with unknown cluster number
Hu et al. Parallel clustering of big data of spatio-temporal trajectory
Zhang et al. Discovering similar Chinese characters in online handwriting with deep convolutional neural networks
Ah-Pine et al. Similarity based hierarchical clustering with an application to text collections
CN103870489A (en) Chinese name self-extension recognition method based on search logs
Singh et al. Survey on outlier detection in data mining
Patra et al. Distance based incremental clustering for mining clusters of arbitrary shapes
CN103150372B (en) The clustering method of magnanimity higher-dimension voice data based on centre indexing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant