CN104503844A - MapReduce operation fine granularity sorting method based on multi-stage characteristics - Google Patents
MapReduce operation fine granularity sorting method based on multi-stage characteristics Download PDFInfo
- Publication number
- CN104503844A CN104503844A CN201410836410.XA CN201410836410A CN104503844A CN 104503844 A CN104503844 A CN 104503844A CN 201410836410 A CN201410836410 A CN 201410836410A CN 104503844 A CN104503844 A CN 104503844A
- Authority
- CN
- China
- Prior art keywords
- stage
- class
- mapreduce
- cmi
- imi
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a MapReduce operation fine granularity sorting method based on multi-stage characteristics. By independently sorting the characteristics of each stage, similar operation of each operation in different stages can be obtained, and the similar operation can be similarly optimized in the stage, so that stage-by-stage rapid optimization on MapReduce can be implemented, the optimization purpose is clearer, the optimization efficiency is improved, meanwhile, fine granularity choke point analysis on a MapReduce workflow can be facilitated by utilizing the sorting results, the program design can be improved in a targeted manner by finding out a choke point which restricts the running performance of a program, and the performances of the program are improved.
Description
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of MapReduce operation fine grit classification method based on multistage feature.
Background technology
MapReduce is distributed data processing programming model.MapReduce process data procedures is mainly divided into 2 stages: Map stage and Reduce stage.First perform the Map stage, then perform the Reduce stage.The Map stage can Further Division be read, map, collect, spill and merge five subs, and the Reduce stage can Further Division be also shuffle, sort, reduce and write tetra-subs.
Performance characteristic data during the operation of MapReduce can be obtained by distributed monitoring system Ganglia.Ganglia is the cluster observation project of increasing income that UC Berkeley initiates, and is designed for and measures thousands of nodes.The core of Ganglia comprises gmond, gmetad and a web front end.Mainly be used for monitor system performance, as: cpu, mem, hard disk utilization factor, I/O load, network traffic conditions etc., be easy to the duty seeing each node by curve, to Reasonable adjustment, distributing system resource, improve entire system performance and play an important role.Every platform computing machine all runs the finger daemon that was collected and sent the gmond by name of metric data.The main frame receiving all metric datas can show these data and the list of simplifying of these data can be delivered in hierarchical structure.Just because of there is this hierarchical structure pattern, just make Ganglia can realize good expansion.The system load that gmond brings is considerably less, and this makes it become one section of code that each computing machine in the cluster runs, and can not affect user performance.All these data are repeatedly collected can affect joint behavior." shake " in network occurs in a large amount of little message when occurring simultaneously, by being consistent by nodal clock, can avoid this problem.Gmetad can be deployed in arbitrary node in cluster or be connected to the unique host of cluster by network, and it is communicated with gmond by the mode of singlecast router, the status information of collecting zone interior nodes, and with the form of XML data, preserves in a database.For the classification problem of MapReduce operation, when mainly running for MapReduce at present, the performance data of entirety is analyzed, the performance characteristic in each stage is not considered on refinement ground, but in practice, during job run, the performance of different phase is but different, homework types different like this has the performance bottleneck of different phase, needs to take diverse ways to carry out analyzing and tuning targetedly according to ground performance bottleneck of different stages.So, need research and formulate a set of fine granularity ground MapReduce job class method to realize operation ground precise classification.
Summary of the invention
Have in view of that, be necessary to provide a kind of can to the fine grit classification method of MapReduce operation.
For achieving the above object, the present invention adopts following technical proposals:
Based on a MapReduce operation fine grit classification method for multistage feature, comprise the steps:
Step S110: the performance data of collecting each node in Hadoop cluster, described performance data comprises CPU usage, memory usage and I/O utilization rate,
The set of described CPU usage is C
jm={ C1
jm, C2
jm... Cn
jm;
The set of described memory usage is M
jm={ M1
jm, M2
jm... Mn
jm;
The set of described I/O utilization rate is I
jm={ I1
jm, I2
jm... In
jm;
Wherein, m is described Hadoop cluster operation quantity to be sorted, and the set of m operation is designated as Job={J1, j2 ... jm}, n are the number of nodes of Hadoop cluster;
Step S120: the set adding up the average of the performance data of each operation, is designated as respectively:
CMean
jm=(C1
jm+C2
jm+...+Cn
jm)/n;
MMean
jm=(M1
jm+M2
jm+...+Mn
jm)/n;
IMean
jm=(I1
jm+I2
jm+...+In
jm)/n;
Step S130: the average of the performance data of each operation was divided into for 9 stages, is designated as respectively:
CMean
jm={CM1
jm,CM2
jm,...CM9
jm};
MMean
jm={MM1
jm,MM2
jm,MM9
jm};
IMean
jm={IM1
jm,IM2
jm,IM9
jm};
Step S140: utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively.
Preferably, in step S140, utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively, comprise the steps:
Step S141: by each class proper vector Fi in m class operation quantity
jmbe described, wherein, Fi
jm={ CMi
jm, MMi
jm, IMi
jm;
Step S142: find immediate two classes and be merged into a class;
Step S143: recalculate new class and the Euclidean distance of haveing been friends in the past between class;
Step S144: repeat step S142 step and step S143, till being to the last merged into a class.
Preferably, in step S142, find immediate two classes and be merged into a class and be specially: for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.
Preferably, the computing method of described Euclidean distance are:
Wherein, the i-th stage had the proper vector of two operations to be:
Fi
jo={CMi
jo,MMi
jo,IMi
jo},Fi
jp={CMi
jp,MMi
jp,IMi
jp}。
MapReduce operation fine grit classification method based on multistage feature provided by the invention, classified by the independent feature to each stage, the similarity operation of each operation in different phase can be obtained, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then MapReduce stage level ground rapid Optimum can be implemented, make the object of optimization clearer and more definite, improve the efficiency of optimization, also be conducive to utilizing the result of classification to carry out the fine granularity bottleneck analysis of MapReduce workflow simultaneously, find out the design that restriction program runnability ground bottleneck can improve program more targetedly, the performance of raising program itself.
Accompanying drawing explanation
Fig. 1 is the flow chart of steps of the MapReduce operation fine grit classification method based on multistage feature of sampling provided by the invention.
Fig. 2 is the flow chart of steps utilizing hierarchical clustering respectively every one-phase to be carried out to hierarchical clustering.
Embodiment
In order to make object of the present invention, technical scheme and beneficial effect clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
Refer to Fig. 1, be the flow chart of steps of the MapReduce operation fine grit classification method based on multistage feature provided by the invention, comprise the steps:
Step S110: the performance data of collecting each node in Hadoop cluster, described performance data comprises CPU usage, memory usage and I/O utilization rate,
The set of described CPU usage is C
jm={ C1
jm, C2
jm... Cn
jm;
The set of described memory usage is M
jm={ M1
jm, M2
jm... Mn
jm;
The set of described I/O utilization rate is I
jm={ I1
jm, I2
jm... In
jm;
Wherein, m is described Hadoop cluster operation quantity to be sorted, and the set of m operation is designated as Job={J1, j2 ... jm}, n are the number of nodes of Hadoop cluster;
In the present embodiment, the set Job={J1 needing to carry out m the operation of classifying is run in the Hadoop cluster of n node being provided with Ganglia, j2...jm}, collects the performance data of each operation when n node runs from the database module of Ganglia.
Step S120: the set adding up the average of the performance data of each operation, is designated as respectively:
CMean
jm=(C1
jm+C2
jm+...+Cn
jm)/n;
MMean
jm=(M1
jm+M2
jm+...+Mn
jm)/n;
IMean
jm=(I1
jm+I2
jm+...+In
jm)/n;
Step S130: the average of the performance data of each operation is divided into 9 stages, is designated as respectively:
CMean
jm={CM1
jm,CM2
jm,...CM9
jm};
MMean
jm={MM1
jm,MM2
jm,MM9
jm};
IMean
jm={IM1
jm,IM2
jm,IM9
jm};
The daily record being appreciated that according to hadoop can determine the beginning and ending time in each stage in MapReduce operational process, according to these beginning and ending times, the mean value of three of the performance data of an each operation set can be divided into 9 sections.
Step S140: utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively.
Be appreciated that, hierarchical clustering is utilized to carry out hierarchical clustering for each stage respectively, need the cluster of carrying out 9 times thus, the similarity operation of each operation in different phase can be obtained like this, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then can implement MapReduce stage level ground rapid Optimum.
Refer to Fig. 2, utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively, comprise the steps:
Step S141: by each class proper vector Fi in m class operation quantity
jmbe described, wherein, Fi
jm={ CMi
jm, MMi
jm, IMi
jm;
Step S142: find immediate two classes and be merged into a class;
Particularly, for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.
Step S143: recalculate new class and the Euclidean distance of haveing been friends in the past between class;
Step S144: repeat step S142 step and step S143, till being to the last merged into a class.
Wherein, the computing method of described Euclidean distance are:
Wherein, the i-th stage had the proper vector of two operations to be:
Fi
jo={CMi
jo,MMi
jo,IMi
jo},Fi
jp={CMi
jp,MMi
jp,IMi
jp}。
MapReduce operation fine grit classification method based on multistage feature provided by the invention, classified by the independent feature to each stage, the similarity operation of each operation in different phase can be obtained, the operation of these similaritys was optimized in this stage with can carrying out similarity, and then MapReduce stage level ground rapid Optimum can be implemented, make the object of optimization clearer and more definite, improve the efficiency of optimization, also be conducive to utilizing the result of classification to carry out the fine granularity bottleneck analysis of MapReduce workflow simultaneously, find out the design that restriction program runnability ground bottleneck can improve program more targetedly, the performance of raising program itself.
It should be noted that: in the various embodiments described above, the description of each embodiment is all given priority to, and does not have the part described in detail in each embodiment, with reference to instructions detailed description in full, can repeat no more herein.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.
Claims (4)
1., based on a MapReduce operation fine grit classification method for multistage feature, it is characterized in that, comprise the steps:
Step S110: the performance data of collecting each node in Hadoop cluster, described performance data comprises CPU usage, memory usage and I/O utilization rate,
The set of described CPU usage is C
jm={ C1
jm, C2
jm... Cn
jm;
The set of described memory usage is M
jm={ M1
jm, M2
jm... Mn
jm;
The set of described I/O utilization rate is I
jm={ I1
jm, I2
jm... In
jm;
Wherein, m is described Hadoop cluster operation quantity to be sorted, and the set of m operation is designated as Job={J1, j2 ... jm}, n are the number of nodes of Hadoop cluster;
Step S120: the set adding up the average of the performance data of each operation, is designated as respectively:
CMean
jm=(C1
jm+C2
jm+…+Cn
jm)/n;
MMean
jm=(M1
jm+M2
jm+…+Mn
jm)/n;
IMean
jm=(I1
jm+I2
jm+…+In
jm)/n;
Step S130: the average of the performance data of each operation was divided into for 9 stages, is designated as respectively:
CMean
jm={CM1
jm,CM2
jm,...CM9
jm};
MMean
jm={MM1
jm,MM2
jm,MM9
jm};
IMean
jm={IM1
jm,IM2
jm,IM9
jm};
Step S140: utilize hierarchical clustering to carry out hierarchical clustering to stage each in MapReduce operational process respectively.
2., as claimed in claim 1 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, in step S140, utilize hierarchical clustering to carry out hierarchical clustering to every one-phase respectively, comprise the steps:
Step S141: by each class proper vector Fi in m class operation quantity
jmbe described, wherein, Fi
jm={ CMi
jm, MMi
jm, IMi
jm;
Step S142: find immediate two classes and be merged into a class;
Step S143: recalculate new class and the Euclidean distance of haveing been friends in the past between class;
Step S144: repeat step S142 step and step S143, till being to the last merged into a class.
3. as claimed in claim 2 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, in step S142, find immediate two classes and be merged into a class and be specially: for m the proper vector in i-th stage, calculate m proper vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, find apart from minimum combination, two classes merging in this being combined become a new class.
4., as claimed in claim 2 or claim 3 based on the MapReduce operation fine grit classification method of multistage feature, it is characterized in that, the computing method of described Euclidean distance are:
Wherein, the i-th stage had the proper vector of two operations to be:
Fi
jo={CMi
jo,MMi
jo,IMi
jo},Fi
jp={CMi
jp,MMi
jp,IMi
jp}。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410836410.XA CN104503844B (en) | 2014-12-29 | 2014-12-29 | A kind of MapReduce operation fine grit classification methods based on multistage feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410836410.XA CN104503844B (en) | 2014-12-29 | 2014-12-29 | A kind of MapReduce operation fine grit classification methods based on multistage feature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104503844A true CN104503844A (en) | 2015-04-08 |
CN104503844B CN104503844B (en) | 2018-03-09 |
Family
ID=52945244
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410836410.XA Active CN104503844B (en) | 2014-12-29 | 2014-12-29 | A kind of MapReduce operation fine grit classification methods based on multistage feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104503844B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915260A (en) * | 2015-06-19 | 2015-09-16 | 北京搜狐新媒体信息技术有限公司 | Hadoop cluster management task distributing method and system |
CN106126407A (en) * | 2016-06-22 | 2016-11-16 | 西安交通大学 | A kind of performance monitoring Operation Optimization Systerm for distributed memory system and method |
CN106341478A (en) * | 2016-09-13 | 2017-01-18 | 广州中大数字家庭工程技术研究中心有限公司 | Education resource sharing system based on Hadoop and realization method |
CN110543588A (en) * | 2019-08-27 | 2019-12-06 | 中国科学院软件研究所 | Distributed clustering method and system for large-scale stream data |
CN110704515A (en) * | 2019-12-11 | 2020-01-17 | 四川新网银行股份有限公司 | Two-stage online sampling method based on MapReduce model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100082599A1 (en) * | 2008-09-30 | 2010-04-01 | Goetz Graefe | Characterizing Queries To Predict Execution In A Database |
CN102004670A (en) * | 2009-12-17 | 2011-04-06 | 华中科技大学 | Self-adaptive job scheduling method based on MapReduce |
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN103631657A (en) * | 2013-11-19 | 2014-03-12 | 浪潮电子信息产业股份有限公司 | Task scheduling algorithm based on MapReduce |
CN103701635A (en) * | 2013-12-10 | 2014-04-02 | 中国科学院深圳先进技术研究院 | Method and device for configuring Hadoop parameters on line |
-
2014
- 2014-12-29 CN CN201410836410.XA patent/CN104503844B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100082599A1 (en) * | 2008-09-30 | 2010-04-01 | Goetz Graefe | Characterizing Queries To Predict Execution In A Database |
CN102004670A (en) * | 2009-12-17 | 2011-04-06 | 华中科技大学 | Self-adaptive job scheduling method based on MapReduce |
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN103631657A (en) * | 2013-11-19 | 2014-03-12 | 浪潮电子信息产业股份有限公司 | Task scheduling algorithm based on MapReduce |
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN103701635A (en) * | 2013-12-10 | 2014-04-02 | 中国科学院深圳先进技术研究院 | Method and device for configuring Hadoop parameters on line |
Non-Patent Citations (2)
Title |
---|
冀素琴,石洪波: "基于mapreduce的k-means聚类集成", 《计算机工程》 * |
鲁伟明,杜晨阳等: "基于mapreduce的分布式近邻传播聚类算法", 《计算机研究与发展》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915260A (en) * | 2015-06-19 | 2015-09-16 | 北京搜狐新媒体信息技术有限公司 | Hadoop cluster management task distributing method and system |
CN104915260B (en) * | 2015-06-19 | 2018-05-25 | 北京搜狐新媒体信息技术有限公司 | A kind of distribution method and system of Hadoop cluster managements task |
CN106126407A (en) * | 2016-06-22 | 2016-11-16 | 西安交通大学 | A kind of performance monitoring Operation Optimization Systerm for distributed memory system and method |
CN106126407B (en) * | 2016-06-22 | 2018-07-17 | 西安交通大学 | A kind of performance monitoring Operation Optimization Systerm and method for distributed memory system |
CN106341478A (en) * | 2016-09-13 | 2017-01-18 | 广州中大数字家庭工程技术研究中心有限公司 | Education resource sharing system based on Hadoop and realization method |
CN110543588A (en) * | 2019-08-27 | 2019-12-06 | 中国科学院软件研究所 | Distributed clustering method and system for large-scale stream data |
CN110704515A (en) * | 2019-12-11 | 2020-01-17 | 四川新网银行股份有限公司 | Two-stage online sampling method based on MapReduce model |
Also Published As
Publication number | Publication date |
---|---|
CN104503844B (en) | 2018-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Negative-aware attention framework for image-text matching | |
Panda et al. | Planet: massively parallel learning of tree ensembles with mapreduce | |
CN104503844B (en) | A kind of MapReduce operation fine grit classification methods based on multistage feature | |
Zhang et al. | A distributed frequent itemset mining algorithm using Spark for Big Data analytics | |
CN101996250B (en) | Hadoop-based mass stream data storage and query method and system | |
CN102831193A (en) | Topic detecting device and topic detecting method based on distributed multistage cluster | |
US20150095381A1 (en) | Method and apparatus for managing time series database | |
US20140207820A1 (en) | Method for parallel mining of temporal relations in large event file | |
CN102799486A (en) | Data sampling and partitioning method for MapReduce system | |
CN105959372A (en) | Internet user data analysis method based on mobile application | |
Zhao et al. | Positive and unlabeled learning for graph classification | |
US20150120637A1 (en) | Apparatus and method for analyzing bottlenecks in data distributed data processing system | |
CN103942308A (en) | Method and device for detecting large-scale social network communities | |
CN108984744A (en) | A kind of non-master chain block self-propagation method | |
Aiello et al. | Behavior-driven clustering of queries into topics | |
CN105389471A (en) | Method for reducing training set of machine learning | |
CN111538766A (en) | Text classification method, device, processing equipment and bill classification system | |
Liu et al. | Reinforcement graph clustering with unknown cluster number | |
Hu et al. | Parallel clustering of big data of spatio-temporal trajectory | |
Zhang et al. | Discovering similar Chinese characters in online handwriting with deep convolutional neural networks | |
Ah-Pine et al. | Similarity based hierarchical clustering with an application to text collections | |
CN103870489A (en) | Chinese name self-extension recognition method based on search logs | |
Singh et al. | Survey on outlier detection in data mining | |
Patra et al. | Distance based incremental clustering for mining clusters of arbitrary shapes | |
CN103150372B (en) | The clustering method of magnanimity higher-dimension voice data based on centre indexing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |