CN104503844B - A kind of MapReduce operation fine grit classification methods based on multistage feature - Google Patents
A kind of MapReduce operation fine grit classification methods based on multistage feature Download PDFInfo
- Publication number
- CN104503844B CN104503844B CN201410836410.XA CN201410836410A CN104503844B CN 104503844 B CN104503844 B CN 104503844B CN 201410836410 A CN201410836410 A CN 201410836410A CN 104503844 B CN104503844 B CN 104503844B
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- mapreduce
- stage
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually classifying to the feature in each stage, similitude operation of each operation in different phase can be obtained, the operation of these similitudes optimizes in which can carry out similitude at this stage, and then MapReduce stage levels can be implemented rapid Optimum, so that the purpose of optimization is clearer and more definite, improve the efficiency of optimization, it is also beneficial to carry out the fine granularity bottleneck analysis of MapReduce workflows simultaneously using the result of classification, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve the performance of program in itself.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of MapReduce operations based on multistage feature
Fine grit classification method.
Background technology
MapReduce is distributed data processing programming model.MapReduce processing data processes are largely divided into 2 ranks
Section:Map stages and Reduce stages.The Map stages are first carried out, then perform the Reduce stages.The Map stages can be further divided into
Five sub-stages of read, map, collect, spill and merge, Reduce stages can also be further divided into shuffle,
Tetra- sub-stages of sort, reduce and write.
Performance characteristic data during MapReduce operation can be obtained by distributed monitoring system Ganglia.
Ganglia is the cluster observation project of increasing income that UC Berkeley are initiated, designed for measuring thousands of nodes.
Ganglia core includes gmond, gmetad and a web front end.Monitor system performance is primarily used to, such as:cpu、
Mem, hard disk utilization, I/O loads, network traffic conditions etc., it is easy to see the working condition of each node by curve, it is right
Reasonable adjustment, distributing system resource, improving systematic entirety can play an important role.Every computer all run one collect and
Send the entitled gmond of metric data finger daemon.Receive the main frame of all metric datas can show these data and
The list of simplifying of these data can be delivered in hierarchical structure.Just because of there is this hierarchical structure pattern, just cause
Ganglia can realize good extension.The system load that gmond is brought is considerably less, and this causes it to turn into each in the cluster
The one section of code run on computer, without influenceing user performance.All these data, which are repeatedly collected, can influence joint behavior.
" shake " in network occurs when a large amount of small message occur simultaneously, can be by the way that nodal clock is consistent, to avoid this
Individual problem.Gmetad can be deployed in any node or the unique host by network connection to cluster in cluster, and it is logical
The mode for crossing singlecast router communicates with gmond, the status information of collecting zone interior nodes, and in the form of XML data, is stored in
In database.For the classification problem of MapReduce operations, overall performance when presently mainly being run for MapReduce
Data are analyzed, and refinement ground considers the performance characteristic in each stage, but in practice, not same order during job run
The performance of section is that different, so different homework type has the performance bottleneck of different phase, it is necessary to according to not
With stage performance bottleneck targetedly take different methods to be analyzed and tuning.It is therefore desirable to study and formulate
A set of fine granularity ground MapReduce job class methods with realizing operation precise classification.
The content of the invention
Have in view of that, it is necessary to provide it is a kind of can be to the fine grit classification method of MapReduce operations.
To achieve the above object, the present invention uses following technical proposals:
A kind of MapReduce operation fine grit classification methods based on multistage feature, comprise the steps:
Step S110:The performance data of each node in Hadoop clusters is collected, the performance data uses including CPU
Rate, memory usage and I/O utilization rates,
The collection of the CPU usage is combined into Cjm={ C1jm, C2jm... Cnjm};
The collection of the memory usage is combined into Mjm={ M1jm, M2jm... Mnjm};
The collection of the I/O utilization rates is combined into Ijm={ I1jm, I2jm... Injm};
Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ J1, j2 ...
Jm }, n is the number of nodes of Hadoop clusters;
Step S120:The set of the average of the performance data of each operation is counted, is designated as respectively:
CMeanjm=(C1jm+C2jm+...+Cnjm)/n;
MMeanjm=(M1jm+M2jm+...+Mnjm)/n;
IMeanjm=(I1jm+I2jm+...+Injm)/n;
Step S130:The average of the performance data of each operation was divided into for 9 stages, is designated as respectively:
CMeanjm={ CM1jm, CM2jm... CM9jm};
MMeanjm={ MM1jm, MM2jm, MM9jm};
IMeanjm={ IM1jm, IM2jm, IM9jm};
Step S140:Hierarchical clustering is carried out to each stage respectively using hierarchical clustering.
Preferably, in step S140, layer is carried out to each stage in MapReduce runnings respectively using hierarchical clustering
Secondary cluster, comprises the steps:
Step S141:In m class operation quantity characteristic vector Fi will be used per a kind ofjmIt is described, wherein, Fijm=
{CMijm, MMijm, IMijm};
Step S142:Find immediate two classes and be merged into one kind;
Step S143:Recalculate the Euclidean distance between new class and had been friends in the past class;
Step S144:Repeat step S142 is walked and step S143, untill being to the last merged into one kind.
Preferably, in step S142, immediate two classes are found and are merged into one kind it is specially:For i-th of stage
M characteristic vector, calculate m characteristic vector Euclidean distance between any two, obtain the distance of m × (m-1) individual combination,
The minimum combination of distance is found, two classes during this is combined, which merge, turns into a new class.
Preferably, the computational methods of the Euclidean distance are:
Wherein, the characteristic vector that there were two operations in the i-th stage is:
Fijo={ CMijo, MMijo, IMijo, Fijp={ CMijp, MMijp, IMijp}。
MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually to every
The feature in individual stage is classified, and can obtain similitude operation of each operation in different phase, the operation of these similitudes
Optimize while similitude can be carried out at this stage, and then MapReduce stage levels can be implemented rapid Optimum so that optimization
Purpose is clearer and more definite, improves the efficiency of optimization, while is also beneficial to carry out the thin of MapReduce workflows using the result of classification
Granularity bottleneck analysis, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve journey
The performance of sequence in itself.
Brief description of the drawings
Fig. 1 is the MapReduce operation fine grit classification methods based on multistage feature of sampling provided by the invention
Flow chart of steps.
Fig. 2 is the step flow chart for carrying out hierarchical clustering to each stage respectively using hierarchical clustering.
Embodiment
In order that the purpose of the present invention, technical scheme and beneficial effect are more clearly understood, below in conjunction with accompanying drawing and implementation
Example, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only explaining this hair
It is bright, it is not intended to limit the present invention.
Referring to Fig. 1, it is the MapReduce operation fine grit classification methods provided by the invention based on multistage feature
Flow chart of steps, comprise the steps:
Step S110:The performance data of each node in Hadoop clusters is collected, the performance data uses including CPU
Rate, memory usage and I/O utilization rates,
The collection of the CPU usage is combined into Cjm={ C1jm, C2jm... Cnjm};
The collection of the memory usage is combined into Mjm={ M1jm, M2jm... Mnjm};
The collection of the I/O utilization rates is combined into Ijm={ I1jm, I2jm... Injm};
Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ J1, j2 ...
Jm }, n is the number of nodes of Hadoop clusters;
In the present embodiment, operation needs what is classified in the Ganglia Hadoop clusters of n node are provided with
The set Job={ J1, j2...jm } of m operation, each operation is collected from Ganglia database module and is transported in n node
Performance data during row.
Step S120:The set of the average of the performance data of each operation is counted, is designated as respectively:
CMeanjm=(C1jm+C2jm+...+Cnjm)/n;
MMeanjm=(M1jm+M2jm+...+Mnjm)/n;
IMeanjm=(I1jm+I2jm+...+Injm)/n;
Step S130:The average of the performance data of each operation is divided into 9 stages, is designated as respectively:
CMeanjm={ CM1jm, CM2jm... CM9jm};
MMeanjm={ MM1jm, MM2jm, MM9jm};
IMeanjm={ IM1jm, IM2jm, IM9jm};
It is appreciated that when can determine the start-stop in each stage in MapReduce runnings according to hadoop daily record
Between, according to these beginning and ending times, the average value of three set of the performance data of each operation can be divided into 9 sections.
Step S140:Hierarchical clustering is carried out to each stage in MapReduce runnings respectively using hierarchical clustering.
It is appreciated that carry out hierarchical clustering for each stage respectively using hierarchical clustering, thus need to carry out 9 times poly-
Class, can so obtain similitude operation of each operation in different phase, and the operation of these similitudes can enter at this stage
Optimize to row similitude, and then MapReduce stage levels can be implemented rapid Optimum.
Referring to Fig. 2, carrying out hierarchical clustering to each stage respectively using hierarchical clustering, comprise the steps:
Step S141:In m class operation quantity characteristic vector Fi will be used per a kind ofjmIt is described, wherein, Fijm=
{CMijm, MMijm, IMijm};
Step S142:Find immediate two classes and be merged into one kind;
Specifically, for the m characteristic vector in i-th of stage, calculate the Euclid of m characteristic vector between any two away from
From obtaining the distance of m × (m-1) individual combination, find the minimum combination of distance, two classes during this is combined, which merge, turns into one
Individual new class.
Step S143:Recalculate the Euclidean distance between new class and had been friends in the past class;
Step S144:Repeat step S142 is walked and step S143, untill being to the last merged into one kind.
Wherein, the computational methods of the Euclidean distance are:
Wherein, the characteristic vector that there were two operations in the i-th stage is:
Fijo={ CMijo, MMijo, IMijo, Fijp={ CMijp, MMijp, IMijp}。
MapReduce operation fine grit classification methods provided by the invention based on multistage feature, by individually to every
The feature in individual stage is classified, and can obtain similitude operation of each operation in different phase, the operation of these similitudes
Optimize while similitude can be carried out at this stage, and then MapReduce stage levels can be implemented rapid Optimum so that optimization
Purpose is clearer and more definite, improves the efficiency of optimization, while is also beneficial to carry out the thin of MapReduce workflows using the result of classification
Granularity bottleneck analysis, the design of program can more targetedly be improved by finding out with restricting program runnability bottleneck, improve journey
The performance of sequence in itself.
It should be noted that:In the various embodiments described above, the description of each embodiment is all given priority to, and is not had in each embodiment
The part being described in detail, the detailed description of specification full text is referred to, here is omitted.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
Member, under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be regarded as
Protection scope of the present invention.
Claims (4)
- A kind of 1. MapReduce operation fine grit classification methods based on multistage feature, it is characterised in that including following steps Suddenly:Step S110:The performance data of each node in Hadoop clusters is collected, the performance data includes CPU usage, interior Utilization rate and I/O utilization rates are deposited,The collection of the CPU usage is combined into Cjm={ C1jm,C2jm,…Cnjm};The collection of the memory usage is combined into Mjm={ M1jm,M2jm,…Mnjm};The collection of the I/O utilization rates is combined into Ijm={ I1jm,I2jm,…Injm};Wherein, m is Hadoop clusters operation quantity to be sorted, and the set of m operation is designated as Job={ j1, j2 ... jm }, N is the number of nodes of Hadoop clusters;Step S120:The set of the average of the performance data of each operation is counted, is designated as respectively:CMeanjm=(C1jm+C2jm+…+Cnjm)/n;MMeanjm=(M1jm+M2jm+…+Mnjm)/n;IMeanjm=(I1jm+I2jm+…+Injm)/n;Step S130:The average of the performance data of each operation is divided into 9 stages, CMeanjmIt is designated as:{CM1jm,CM2jm,… CM9jm};MMeanjmIt is designated as:{MM1jm,MM2jm,MM9jm};IMeanjmIt is designated as:{IM1jm,IM2jm,IM9jm};Step S140:Hierarchical clustering is carried out to each stage in MapReduce runnings respectively using hierarchical clustering.
- 2. the MapReduce operation fine grit classification methods based on multistage feature, its feature exist as claimed in claim 1 In, in step S140, using hierarchical clustering respectively to each stage carry out hierarchical clustering, comprise the steps:Step S141:In m class operation quantity characteristic vector Fi will be used per a kind ofjmIt is described, wherein, Fijm={ CMijm, MMijm,IMijm};Wherein i represents to be directed to i-th of stage;Step S142:Find immediate two classes and be merged into one kind;Step S143:Recalculate the Euclidean distance between new class and had been friends in the past class;Step S144:Repeat step S142 is walked and step S143, untill being to the last merged into one kind.
- 3. the MapReduce operation fine grit classification methods based on multistage feature, its feature exist as claimed in claim 2 In in step S142, finding immediate two classes and be merged into one kind and be specially:For i-th of stage m feature to Amount, the Euclidean distance of m characteristic vector between any two is calculated, obtains the distance of m × (m-1) individual combination, finds distance most Small combination, two classes during this is combined, which merge, turns into a new class.
- 4. the MapReduce operation fine grit classification methods based on multistage feature as claimed in claim 2 or claim 3, its feature It is, the computational methods of the Euclidean distance are:<mrow> <mi>D</mi> <mi>i</mi> <mi>s</mi> <mi>F</mi> <mi>i</mi> <mo>=</mo> <msqrt> <mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>CMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>CMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>MMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>MMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>+</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>IMi</mi> <mrow> <mi>j</mi> <mi>o</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>IMi</mi> <mrow> <mi>j</mi> <mi>p</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> </mrow>Wherein, the characteristic vector that there were two operations in the i-th stage is:Fijo={ CMijo,MMijo,IMijo},Fijp={ CMijp,MMijp,IMijp}。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410836410.XA CN104503844B (en) | 2014-12-29 | 2014-12-29 | A kind of MapReduce operation fine grit classification methods based on multistage feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410836410.XA CN104503844B (en) | 2014-12-29 | 2014-12-29 | A kind of MapReduce operation fine grit classification methods based on multistage feature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104503844A CN104503844A (en) | 2015-04-08 |
CN104503844B true CN104503844B (en) | 2018-03-09 |
Family
ID=52945244
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410836410.XA Active CN104503844B (en) | 2014-12-29 | 2014-12-29 | A kind of MapReduce operation fine grit classification methods based on multistage feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104503844B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915260B (en) * | 2015-06-19 | 2018-05-25 | 北京搜狐新媒体信息技术有限公司 | A kind of distribution method and system of Hadoop cluster managements task |
CN106126407B (en) * | 2016-06-22 | 2018-07-17 | 西安交通大学 | A kind of performance monitoring Operation Optimization Systerm and method for distributed memory system |
CN106341478A (en) * | 2016-09-13 | 2017-01-18 | 广州中大数字家庭工程技术研究中心有限公司 | Education resource sharing system based on Hadoop and realization method |
CN110543588A (en) * | 2019-08-27 | 2019-12-06 | 中国科学院软件研究所 | Distributed clustering method and system for large-scale stream data |
CN110704515B (en) * | 2019-12-11 | 2020-06-02 | 四川新网银行股份有限公司 | Two-stage online sampling method based on MapReduce model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004670A (en) * | 2009-12-17 | 2011-04-06 | 华中科技大学 | Self-adaptive job scheduling method based on MapReduce |
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN103631657A (en) * | 2013-11-19 | 2014-03-12 | 浪潮电子信息产业股份有限公司 | Task scheduling algorithm based on MapReduce |
CN103701635A (en) * | 2013-12-10 | 2014-04-02 | 中国科学院深圳先进技术研究院 | Method and device for configuring Hadoop parameters on line |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100082599A1 (en) * | 2008-09-30 | 2010-04-01 | Goetz Graefe | Characterizing Queries To Predict Execution In A Database |
US9367601B2 (en) * | 2012-03-26 | 2016-06-14 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
-
2014
- 2014-12-29 CN CN201410836410.XA patent/CN104503844B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004670A (en) * | 2009-12-17 | 2011-04-06 | 华中科技大学 | Self-adaptive job scheduling method based on MapReduce |
CN103631657A (en) * | 2013-11-19 | 2014-03-12 | 浪潮电子信息产业股份有限公司 | Task scheduling algorithm based on MapReduce |
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN103701635A (en) * | 2013-12-10 | 2014-04-02 | 中国科学院深圳先进技术研究院 | Method and device for configuring Hadoop parameters on line |
Non-Patent Citations (2)
Title |
---|
基于mapreduce的k-means聚类集成;冀素琴,石洪波;《计算机工程》;20130930;全文 * |
基于mapreduce的分布式近邻传播聚类算法;鲁伟明,杜晨阳等;《计算机研究与发展》;20121231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104503844A (en) | 2015-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104503844B (en) | A kind of MapReduce operation fine grit classification methods based on multistage feature | |
Panda et al. | Planet: massively parallel learning of tree ensembles with mapreduce | |
Sculley | Web-scale k-means clustering | |
US9892006B2 (en) | Non-blocking listener registration in the presence of data grid nodes joining the cluster | |
CN102222092B (en) | Massive high-dimension data clustering method for MapReduce platform | |
US9361343B2 (en) | Method for parallel mining of temporal relations in large event file | |
CN102831193A (en) | Topic detecting device and topic detecting method based on distributed multistage cluster | |
CN105959372A (en) | Internet user data analysis method based on mobile application | |
CN103942308A (en) | Method and device for detecting large-scale social network communities | |
CN106557558A (en) | A kind of data analysing method and device | |
CN104750780A (en) | Hadoop configuration parameter optimization method based on statistic analysis | |
CN104137095A (en) | System for evolutionary analytics | |
Hu et al. | Actnet: Active learning for networked texts in microblogging | |
US20220334969A1 (en) | Multi-cache based digital output generation | |
Cagliero et al. | Misleading generalized itemset discovery | |
CN103870489B (en) | Chinese personal name based on search daily record is from extending recognition methods | |
CN106897313B (en) | Mass user service preference evaluation method and device | |
Mithun et al. | Generating diverse image datasets with limited labeling | |
CN105468669A (en) | Adaptive microblog topic tracking method fusing with user relationship | |
Abdullah et al. | An integrated-model QoS-based graph for web service recommendation | |
Zhu | Improved collective influence of finding most influential nodes based on disjoint-set reinsertion | |
CN107203554A (en) | A kind of distributed search method and device | |
Yan et al. | Cloud city traffic state assessment system using a novel architecture of big data | |
CN114281989A (en) | Data deduplication method and device based on text similarity, storage medium and server | |
Zheng et al. | Towards edge-cloud collaborative machine learning: A quality-aware task partition framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |