CN104503844B - A kind of MapReduce operation fine grit classification methods based on multistage feature - Google Patents

A kind of MapReduce operation fine grit classification methods based on multistage feature Download PDF

Info

Publication number
CN104503844B
CN104503844B CN201410836410.XA CN201410836410A CN104503844B CN 104503844 B CN104503844 B CN 104503844B CN 201410836410 A CN201410836410 A CN 201410836410A CN 104503844 B CN104503844 B CN 104503844B
Authority
CN
China
Prior art keywords
mrow
msub
mapreduce
stage
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410836410.XA
Other languages
Chinese (zh)
Other versions
CN104503844A (en
Inventor
贝振东
喻之斌
须成忠
曾经纬
田盼
张慧玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201410836410.XA priority Critical patent/CN104503844B/en
Publication of CN104503844A publication Critical patent/CN104503844A/en
Application granted granted Critical
Publication of CN104503844B publication Critical patent/CN104503844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually classifying to the feature in each stage, similitude operation of each operation in different phase can be obtained, the operation of these similitudes optimizes in which can carry out similitude at this stage, and then MapReduce stage levels can be implemented rapid Optimum, so that the purpose of optimization is clearer and more definite, improve the efficiency of optimization, it is also beneficial to carry out the fine granularity bottleneck analysis of MapReduce workflows simultaneously using the result of classification, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve the performance of program in itself.

Description

A kind of MapReduce operation fine grit classification methods based on multistage feature
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of MapReduce operations based on multistage feature Fine grit classification method.
Background technology
MapReduce is distributed data processing programming model.MapReduce processing data processes are largely divided into 2 ranks Section:Map stages and Reduce stages.The Map stages are first carried out, then perform the Reduce stages.The Map stages can be further divided into Five sub-stages of read, map, collect, spill and merge, Reduce stages can also be further divided into shuffle, Tetra- sub-stages of sort, reduce and write.
Performance characteristic data during MapReduce operation can be obtained by distributed monitoring system Ganglia. Ganglia is the cluster observation project of increasing income that UC Berkeley are initiated, designed for measuring thousands of nodes. Ganglia core includes gmond, gmetad and a web front end.Monitor system performance is primarily used to, such as:cpu、 Mem, hard disk utilization, I/O loads, network traffic conditions etc., it is easy to see the working condition of each node by curve, it is right Reasonable adjustment, distributing system resource, improving systematic entirety can play an important role.Every computer all run one collect and Send the entitled gmond of metric data finger daemon.Receive the main frame of all metric datas can show these data and The list of simplifying of these data can be delivered in hierarchical structure.Just because of there is this hierarchical structure pattern, just cause Ganglia can realize good extension.The system load that gmond is brought is considerably less, and this causes it to turn into each in the cluster The one section of code run on computer, without influenceing user performance.All these data, which are repeatedly collected, can influence joint behavior. " shake " in network occurs when a large amount of small message occur simultaneously, can be by the way that nodal clock is consistent, to avoid this Individual problem.Gmetad can be deployed in any node or the unique host by network connection to cluster in cluster, and it is logical The mode for crossing singlecast router communicates with gmond, the status information of collecting zone interior nodes, and in the form of XML data, is stored in In database.For the classification problem of MapReduce operations, overall performance when presently mainly being run for MapReduce Data are analyzed, and refinement ground considers the performance characteristic in each stage, but in practice, not same order during job run The performance of section is that different, so different homework type has the performance bottleneck of different phase, it is necessary to according to not With stage performance bottleneck targetedly take different methods to be analyzed and tuning.It is therefore desirable to study and formulate A set of fine granularity ground MapReduce job class methods with realizing operation precise classification.
The content of the invention
Have in view of that, it is necessary to provide it is a kind of can be to the fine grit classification method of MapReduce operations.
To achieve the above object, the present invention uses following technical proposals:
A kind of MapReduce operation fine grit classification methods based on multistage feature, comprise the steps:
Step S110:The performance data of each node in Hadoop clusters is collected, the performance data uses including CPU Rate, memory usage and I/O utilization rates,
The collection of the CPU usage is combined into Cjm={ C1jm, C2jm... Cnjm};
The collection of the memory usage is combined into Mjm={ M1jm, M2jm... Mnjm};
The collection of the I/O utilization rates is combined into Ijm={ I1jm, I2jm... Injm};
Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ J1, j2 ... Jm }, n is the number of nodes of Hadoop clusters;
Step S120:The set of the average of the performance data of each operation is counted, is designated as respectively:
CMeanjm=(C1jm+C2jm+...+Cnjm)/n;
MMeanjm=(M1jm+M2jm+...+Mnjm)/n;
IMeanjm=(I1jm+I2jm+...+Injm)/n;
Step S130:The average of the performance data of each operation was divided into for 9 stages, is designated as respectively:
CMeanjm={ CM1jm, CM2jm... CM9jm};
MMeanjm={ MM1jm, MM2jm, MM9jm};
IMeanjm={ IM1jm, IM2jm, IM9jm};
Step S140:Hierarchical clustering is carried out to each stage respectively using hierarchical clustering.
Preferably, in step S140, layer is carried out to each stage in MapReduce runnings respectively using hierarchical clustering Secondary cluster, comprises the steps:
Step S141:In m class operation quantity characteristic vector Fi will be used per a kind ofjmIt is described, wherein, Fijm= {CMijm, MMijm, IMijm};
Step S142:Find immediate two classes and be merged into one kind;
Step S143:Recalculate the Euclidean distance between new class and had been friends in the past class;
Step S144:Repeat step S142 is walked and step S143, untill being to the last merged into one kind.
Preferably, in step S142, immediate two classes are found and are merged into one kind it is specially:For i-th of stage M characteristic vector, calculate m characteristic vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination, The minimum combination of distance is found, two classes during this is combined, which merge, turns into a new class.
Preferably, the computational methods of the Euclidean distance are:
Wherein, the characteristic vector that there were two operations in the i-th stage is:
Fijo={ CMijo, MMijo, IMijo, Fijp={ CMijp, MMijp, IMijp}。
MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually to every The feature in individual stage is classified, and can obtain similitude operation of each operation in different phase, the operation of these similitudes Optimize while similitude can be carried out at this stage, and then MapReduce stage levels can be implemented rapid Optimum so that optimization Purpose is clearer and more definite, improves the efficiency of optimization, while is also beneficial to carry out the thin of MapReduce workflows using the result of classification Granularity bottleneck analysis, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve journey The performance of sequence in itself.
Brief description of the drawings
Fig. 1 is the MapReduce operation fine grit classification methods based on multistage feature of sampling provided by the invention Flow chart of steps.
Fig. 2 is the step flow chart for carrying out hierarchical clustering to each stage respectively using hierarchical clustering.
Embodiment
In order that the purpose of the present invention, technical scheme and beneficial effect are more clearly understood, below in conjunction with accompanying drawing and implementation Example, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only explaining this hair It is bright, it is not intended to limit the present invention.
Referring to Fig. 1, it is the MapReduce operation fine grit classification methods provided by the invention based on multistage feature Flow chart of steps, comprise the steps:
Step S110:The performance data of each node in Hadoop clusters is collected, the performance data uses including CPU Rate, memory usage and I/O utilization rates,
The collection of the CPU usage is combined into Cjm={ C1jm, C2jm... Cnjm};
The collection of the memory usage is combined into Mjm={ M1jm, M2jm... Mnjm};
The collection of the I/O utilization rates is combined into Ijm={ I1jm, I2jm... Injm};
Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ J1, j2 ... Jm }, n is the number of nodes of Hadoop clusters;
In the present embodiment, operation needs what is classified in the Ganglia Hadoop clusters of n node are provided with The set Job={ J1, j2...jm } of m operation, each operation is collected from Ganglia database module and is transported in n node Performance data during row.
Step S120:The set of the average of the performance data of each operation is counted, is designated as respectively:
CMeanjm=(C1jm+C2jm+...+Cnjm)/n;
MMeanjm=(M1jm+M2jm+...+Mnjm)/n;
IMeanjm=(I1jm+I2jm+...+Injm)/n;
Step S130:The average of the performance data of each operation is divided into 9 stages, is designated as respectively:
CMeanjm={ CM1jm, CM2jm... CM9jm};
MMeanjm={ MM1jm, MM2jm, MM9jm};
IMeanjm={ IM1jm, IM2jm, IM9jm};
It is appreciated that when can determine the start-stop in each stage in MapReduce runnings according to hadoop daily record Between, according to these beginning and ending times, the average value of three set of the performance data of each operation can be divided into 9 sections.
Step S140:Hierarchical clustering is carried out to each stage in MapReduce runnings respectively using hierarchical clustering.
It is appreciated that carry out hierarchical clustering for each stage respectively using hierarchical clustering, thus need to carry out 9 times poly- Class, can so obtain similitude operation of each operation in different phase, and the operation of these similitudes can enter at this stage Optimize to row similitude, and then MapReduce stage levels can be implemented rapid Optimum.
Referring to Fig. 2, carrying out hierarchical clustering to each stage respectively using hierarchical clustering, comprise the steps:
Step S141:In m class operation quantity characteristic vector Fi will be used per a kind ofjmIt is described, wherein, Fijm= {CMijm, MMijm, IMijm};
Step S142:Find immediate two classes and be merged into one kind;
Specifically, for the m characteristic vector in i-th of stage, calculate the Euclid of m characteristic vector between any two away from From obtaining the distance of m × (m-1) individual combination, find the minimum combination of distance, two classes during this is combined, which merge, turns into one Individual new class.
Step S143:Recalculate the Euclidean distance between new class and had been friends in the past class;
Step S144:Repeat step S142 is walked and step S143, untill being to the last merged into one kind.
Wherein, the computational methods of the Euclidean distance are:
Wherein, the characteristic vector that there were two operations in the i-th stage is:
Fijo={ CMijo, MMijo, IMijo, Fijp={ CMijp, MMijp, IMijp}。
MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually to every The feature in individual stage is classified, and can obtain similitude operation of each operation in different phase, the operation of these similitudes Optimize while similitude can be carried out at this stage, and then MapReduce stage levels can be implemented rapid Optimum so that optimization Purpose is clearer and more definite, improves the efficiency of optimization, while is also beneficial to carry out the thin of MapReduce workflows using the result of classification Granularity bottleneck analysis, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve journey The performance of sequence in itself.
It should be noted that:In the various embodiments described above, the description of each embodiment is all given priority to, and is not had in each embodiment The part being described in detail, the detailed description of specification full text is referred to, here is omitted.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims (4)

  1. A kind of 1. MapReduce operation fine grit classification methods based on multistage feature, it is characterised in that including following steps Suddenly:
    Step S110:The performance data of each node in Hadoop clusters is collected, the performance data includes CPU usage, interior Utilization rate and I/O utilization rates are deposited,
    The collection of the CPU usage is combined into Cjm={ C1jm,C2jm,…Cnjm};
    The collection of the memory usage is combined into Mjm={ M1jm,M2jm,…Mnjm};
    The collection of the I/O utilization rates is combined into Ijm={ I1jm,I2jm,…Injm};
    Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ j1, j2 ... jm }, N is the number of nodes of Hadoop clusters;
    Step S120:The set of the average of the performance data of each operation is counted, is designated as respectively:
    CMeanjm=(C1jm+C2jm+…+Cnjm)/n;
    MMeanjm=(M1jm+M2jm+…+Mnjm)/n;
    IMeanjm=(I1jm+I2jm+…+Injm)/n;
    Step S130:The average of the performance data of each operation is divided into 9 stages, CMeanjmIt is designated as:{CM1jm,CM2jm,… CM9jm};MMeanjmIt is designated as:{MM1jm,MM2jm,MM9jm};IMeanjmIt is designated as:{IM1jm,IM2jm,IM9jm};
    Step S140:Hierarchical clustering is carried out to each stage in MapReduce runnings respectively using hierarchical clustering.
  2. 2. the MapReduce operation fine grit classification methods based on multistage feature, its feature exist as claimed in claim 1 In, in step S140, using hierarchical clustering respectively to each stage carry out hierarchical clustering, comprise the steps:
    Step S141:In m class operation quantity characteristic vector Fi will be used per a kind ofjmIt is described, wherein, Fijm={ CMijm, MMijm,IMijm};Wherein i represents to be directed to i-th of stage;
    Step S142:Find immediate two classes and be merged into one kind;
    Step S143:Recalculate the Euclidean distance between new class and had been friends in the past class;
    Step S144:Repeat step S142 is walked and step S143, untill being to the last merged into one kind.
  3. 3. the MapReduce operation fine grit classification methods based on multistage feature, its feature exist as claimed in claim 2 In in step S142, finding immediate two classes and be merged into one kind and be specially:For i-th of stage m feature to Amount, the Euclidean distance of m characteristic vector between any two is calculated, obtains the distance of m × (m-1) individual combination, finds distance most Small combination, two classes during this is combined, which merge, turns into a new class.
  4. 4. the MapReduce operation fine grit classification methods based on multistage feature as claimed in claim 2 or claim 3, its feature It is, the computational methods of the Euclidean distance are:
    <mrow> <mi>D</mi> <mi>i</mi> <mi>s</mi> <mi>F</mi> <mi>i</mi> <mo>=</mo> <msqrt> <mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>CMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>CMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>MMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>MMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>IMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>IMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> </mrow>
    Wherein, the characteristic vector that there were two operations in the i-th stage is:
    Fijo={ CMijo,MMijo,IMijo},Fijp={ CMijp,MMijp,IMijp}。
CN201410836410.XA 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature Active CN104503844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410836410.XA CN104503844B (en) 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410836410.XA CN104503844B (en) 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature

Publications (2)

Publication Number Publication Date
CN104503844A CN104503844A (en) 2015-04-08
CN104503844B true CN104503844B (en) 2018-03-09

Family

ID=52945244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410836410.XA Active CN104503844B (en) 2014-12-29 2014-12-29 A kind of MapReduce operation fine grit classification methods based on multistage feature

Country Status (1)

Country Link
CN (1) CN104503844B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915260B (en) * 2015-06-19 2018-05-25 北京搜狐新媒体信息技术有限公司 A kind of distribution method and system of Hadoop cluster managements task
CN106126407B (en) * 2016-06-22 2018-07-17 西安交通大学 A kind of performance monitoring Operation Optimization Systerm and method for distributed memory system
CN106341478A (en) * 2016-09-13 2017-01-18 广州中大数字家庭工程技术研究中心有限公司 Education resource sharing system based on Hadoop and realization method
CN110543588A (en) * 2019-08-27 2019-12-06 中国科学院软件研究所 Distributed clustering method and system for large-scale stream data
CN110704515B (en) * 2019-12-11 2020-06-02 四川新网银行股份有限公司 Two-stage online sampling method based on MapReduce model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004670A (en) * 2009-12-17 2011-04-06 华中科技大学 Self-adaptive job scheduling method based on MapReduce
CN103605576A (en) * 2013-11-25 2014-02-26 华中科技大学 Multithreading-based MapReduce execution system
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN103701635A (en) * 2013-12-10 2014-04-02 中国科学院深圳先进技术研究院 Method and device for configuring Hadoop parameters on line

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082599A1 (en) * 2008-09-30 2010-04-01 Goetz Graefe Characterizing Queries To Predict Execution In A Database
US9367601B2 (en) * 2012-03-26 2016-06-14 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004670A (en) * 2009-12-17 2011-04-06 华中科技大学 Self-adaptive job scheduling method based on MapReduce
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN103605576A (en) * 2013-11-25 2014-02-26 华中科技大学 Multithreading-based MapReduce execution system
CN103701635A (en) * 2013-12-10 2014-04-02 中国科学院深圳先进技术研究院 Method and device for configuring Hadoop parameters on line

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于mapreduce的k-means聚类集成;冀素琴,石洪波;《计算机工程》;20130930;全文 *
基于mapreduce的分布式近邻传播聚类算法;鲁伟明,杜晨阳等;《计算机研究与发展》;20121231;全文 *

Also Published As

Publication number Publication date
CN104503844A (en) 2015-04-08

Similar Documents

Publication Publication Date Title
CN104503844B (en) A kind of MapReduce operation fine grit classification methods based on multistage feature
Panda et al. Planet: massively parallel learning of tree ensembles with mapreduce
Sculley Web-scale k-means clustering
US9892006B2 (en) Non-blocking listener registration in the presence of data grid nodes joining the cluster
CN102222092B (en) Massive high-dimension data clustering method for MapReduce platform
US9361343B2 (en) Method for parallel mining of temporal relations in large event file
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN105959372A (en) Internet user data analysis method based on mobile application
CN103942308A (en) Method and device for detecting large-scale social network communities
CN106557558A (en) A kind of data analysing method and device
CN104750780A (en) Hadoop configuration parameter optimization method based on statistic analysis
CN104137095A (en) System for evolutionary analytics
Hu et al. Actnet: Active learning for networked texts in microblogging
US20220334969A1 (en) Multi-cache based digital output generation
Cagliero et al. Misleading generalized itemset discovery
CN103870489B (en) Chinese personal name based on search daily record is from extending recognition methods
CN106897313B (en) Mass user service preference evaluation method and device
Mithun et al. Generating diverse image datasets with limited labeling
CN105468669A (en) Adaptive microblog topic tracking method fusing with user relationship
Abdullah et al. An integrated-model QoS-based graph for web service recommendation
Zhu Improved collective influence of finding most influential nodes based on disjoint-set reinsertion
CN107203554A (en) A kind of distributed search method and device
Yan et al. Cloud city traffic state assessment system using a novel architecture of big data
CN114281989A (en) Data deduplication method and device based on text similarity, storage medium and server
Zheng et al. Towards edge-cloud collaborative machine learning: A quality-aware task partition framework

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant