CN108647135A - A kind of Hadoop parameter automated tuning methods based on microoperation - Google Patents
A kind of Hadoop parameter automated tuning methods based on microoperation Download PDFInfo
- Publication number
- CN108647135A CN108647135A CN201810426699.6A CN201810426699A CN108647135A CN 108647135 A CN108647135 A CN 108647135A CN 201810426699 A CN201810426699 A CN 201810426699A CN 108647135 A CN108647135 A CN 108647135A
- Authority
- CN
- China
- Prior art keywords
- microoperation
- phase
- parameter
- model
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000009897 systematic effect Effects 0.000 claims abstract description 9
- 238000005457 optimization Methods 0.000 claims description 10
- 230000006399 behavior Effects 0.000 claims 2
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000000262 cochlear duct Anatomy 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 238000002054 transplantation Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3447—Performance evaluation by modeling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Debugging And Monitoring (AREA)
- Organic Low-Molecular-Weight Compounds And Preparation Thereof (AREA)
Abstract
The invention belongs to field of cloud computer technology, particularly relate to a kind of Hadoop parameter automated tuning methods based on microoperation.The present invention by the decoupling of mapreduce tasks mainly by being determined as different phase different type microoperation, then the model that single executes time and single treatment data volume is established according to the microoperation of selection, combination is reconstructed to operational process according to the model of foundation and obtains the relationship of stage running time and systematic parameter, finally finding in model makes the shortest parameter combination of Runtime.The method of the present invention does not change with homework type and cluster configuration and is changed, at the same search optimized parameter take it is short, it is efficient, it is portable good.
Description
Technical field
The invention belongs to field of cloud computer technology, particularly relate to a kind of Hadoop parameters reconstructed based on microoperation
Automated tuning method.
Background technology
Distributed platform resource optimization problem is always one of the much-talked-about topic that user pays close attention to, particularly, distributed
The optimization of platform operations run time is always primary study object, universal, the shortening job run time of cloud service in recent years
Leaseholder can be helped, which to improve working efficiency, reduces hiring cost, while supplier being helped to realize maximum resource utilization.
In recent years, hadoop Distributed Computing Platforms industrial quarters obtain it is ripe, be widely applied, and in science
The optimization on boundary, hadoop platform various aspects is still primary study object.With the continuous renewal of hadoop versions, effect is calculated
Rate has no longer been people's problem of interest, and huge production cluster gradually generates expensive O&M expense, and high in the clouds resource
Unreasonable distribution is so that the cost problem of company highlights further, therefore, high in the clouds distributed computing framework in calculating process at
This optimization problem is one of the problem that major IT companies are urgently to be resolved hurrily instantly.
Have some achievements in research for the optimization of hadoop job run times:
1) it puts to good use, Feng Dan, in a kind of Hadoop parameter automated tuning method and system based on machine learning of Ruili.
CN106202431A.2016.
This method is clustered by the resource consumption feature to different work type and is established different performance models, from
It is dynamic to obtain the different parameters being affected to inhomogeneity application, and quantitative parameter recommendation value.This method efficiently solves existing
Some is based on the method for empirical law to the restricted problem of the high Dependence Problem and qualitative parameter suggestion of user experience.
2) a kind of parameter automated tuning method of iterative type MapReduce operations of Zhao Gansen, Gao Xiaojie, Tang Hua.
CN106326005A.2017.
This method determines new parameter configuration by executing actual job and assessing job execution effect in parameter space
Combination is further continued for iteration and executes operation, terminates to require until meeting, this method can improve iteration MapReduce operations each time
Operational efficiency brings convenience for user, and greatly reduces the waste of time resource.
In terms of nearly 2 years Patents, influence that primary focus changes the activity duration in the variation for portraying parameter.
There are the emphasis that another people equally pays close attention to very much, platform transplantation in Hadoop parameter automated tunings.Different clusters are different
If quickly establishing tuning model has critically important realistic meaning under homework type.
Invention content
It is to be solved by this invention, it is exactly the rise in view of cloud computing service instantly, hadoop parameter automated tunings have
Important realistic meaning, propose it is a kind of based on microoperation reconstruct hadoop2.0 parameter automated tuning methods.
The technical solution adopted by the present invention is:
A kind of Hadoop parameter automated tuning methods based on microoperation, this method are held for optimizing MapReduce operations
Parameter combination when row, which is characterized in that include the following steps:
S1, microoperation model is established:
S11, selection microoperation:MapReduce tasks are decoupled, collection phase single memory in Map tasks is selected
Stage single memory is shuffled in write operation cm_micro_op and single disk write operation cd_micro_op and Reduce task
Write operation sm_micro_op, single memory overflow disk write operation sd_micro_op and single Piece file mergence disk write operation
Merge_micro_op is microoperation;
S12, the microoperation selected according to step S11, determine Parameters variation influential on its space;
S13, parameter value difference determine single microoperation processing data amount difference, in parameter space each dimension from
It dissipates value and executes actual job and tested as microoperation model reference, test single microoperation is in processing different data amount situation
Under rate;
S14, collect benchmark test use-case in the case that the different parameters of different phase execution journal, respectively to not same order
The single disk write operation and single memory write operation of section establish the model that single executes time and single treatment data volume:
Tmicro_op=α * Dmicro_op+β
Tmicro_opIndicate that microoperation executes time, Dmicro_opIndicate that the data volume of microoperation single treatment, α and β are model
Parameter;
S2, it combination is reconstructed according to the operational process of collection phase in microoperation model obtains the stage running time and is
The relationship for parameter of uniting:
S21, pass between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it
System;
S22, collection phase is reconstructed based on microoperation to obtain the operational process of practical collection phase, is received
Collection stage relevant parameter and the stage execute the relationship of time;
S3, it combination is reconstructed according to the operational process in the stage of shuffling in microoperation model obtains the stage running time and is
The relationship for parameter of uniting:
S31, pass between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it
System;
S32, the operational process for actually being shuffled the stage is reconstructed to the stage of shuffling based on microoperation, is mixed
It washes stage relevant parameter and the stage executes the relationship of time;
S33, in Reduce tasks sequence write phase execute the time individually model, memory overflow disk write number and
Discrete value in the parameter space that data volume determines executes actual job task, tests the stage in different parameters
It executes the time and shuffles the relationship that memory in the stage overflows disk write number and the phase process data volume, establish sequence write phase
It executes the time and memory overflows disk write number and the model of the phase process total amount of data:
Tsw_phase=Dsw_input*(Nspill*αsw_phgase+βsw_phase)
Tsw_phaseIndicate sequence write phase run time, Dsw_inputIndicate single reduce tasks input data amount, Nspill
Expression shuffles stage memory and overflows disk write number, αsw_phaseAnd βsw_phaseFor model parameter;
S4, it looks in model and makes the shortest parameter combination of Runtime:Model scala media is obtained using chess game optimization algorithm
Duan Zhihang times shortest parameter combination obtains respective best parameter group in different phase search.
The beneficial effects of the invention are as follows:
(1) propose that a kind of fine granularity can accurately portray the microoperation model of parametric variations.The model can be intuitive accurate
The influence that true describing system Parameters variation comes to executing time-bands.The model makes to multi-parameter simultaneously from the angle of data flow
The analysis of job execution time change becomes convenient and accurate when variation.
(2) a kind of strategy carrying out microoperation reconstruct according to operation logic is proposed.This method is not with homework type sum aggregate
Group configuration change and change, while search optimized parameter take it is short, it is efficient, portability it is good.This method can be considered a kind of excellent
The description method and analytical framework of change problem portray Parameters variation principle from more fine-grained angle and establish the optimal ginseng of model searching
Array is closed.
Description of the drawings
Fig. 1 is the logic diagram of the MapReduce operations in the present invention.
Specific implementation mode
Below in conjunction with the accompanying drawings, detailed description of the present invention technical solution:
Step 1:For the fine granularity operation that parameter directly affects, different models, core are established according to action type difference
Steps are as follows for the heart:
1) mapreduce tasks are carried out decoupling determining different phase different type microoperation by mode as shown in Figure 1:
Collection phase single memory write operation cm_micro_op, collection phase single disk write operations cd_
micro_op;Shuffle phase single memory write operation sm_micro_op, shuffle phase single spill disk writes
Operate sd_micro_op, shuffle phase single merge disk write operations merge_micro_op.
2) Parameters variation influential on its space is determined according to the microoperation of selection.Influence cm_micro_op and cd_
The parameter of micro_op is io.sort.mb and sort.spill.percent, and valued space is respective value range;Influence sm_
The parameter of micro_op, sd_micro_op and merge_micro_op are reduce.java.opts,
Shuffle.input.buffer.percent, shuffle.merge.percent and io.sort.factor, valued space are
Respective variation range.
3) parameter value difference determines the difference of single microoperation processing data amount, and each dimension is discrete in parameter space
Value simultaneously executes actual job as the test of microoperation model reference, and test single microoperation is in processing different data amount
Rate.
4) collect benchmark test use-case in the case that different phase the execution journal in different parameters, respectively to not same order
The single disk write operation and single memory write operation of section establish the model that single executes time and single treatment data volume.
Tmicro_op=α * Dmicro_op+β
Above formula is that the microoperation single established executes the linear model of time and single treatment data volume.Tmicro_opIndicate micro-
Operation executes time, Dmicro_opIndicate that the data volume of microoperation single treatment, α and β are model parameter.
Step 2:Combination is reconstructed according to the operational process of collection phase in microoperation model and obtains the stage
The relationship of run time and systematic parameter, core procedure are as follows:
1) multiple parameters have codetermined the data volume of single microoperation processing, the collection established by step 1
The relationship that phase microoperations execute time and data volume is established the microoperation time and is operated on it between influential systematic parameter
Relationship.
2) collection phase as shown in Figure 1 are reconstructed to obtain reality based on microoperation
The operational process of collection phase, portrays collection phase relevant parameters and collection phase are executed
The relationship of time.
Step 3:Combination is reconstructed according to the operational process of shuffle phase in microoperation model and obtains stage fortune
The relationship of row time and systematic parameter, core procedure are as follows:
1) multiple parameters have codetermined the data volume of single microoperation processing, the shuffle established by step 1
The relationship that phase microoperations execute time and data volume is established the microoperation time and is operated on it between influential systematic parameter
Relationship.
2) shuffle phase as shown in Figure 1 are reconstructed to obtain practical shuffle based on microoperation
The operational process of phase, obtains shuffle phase relevant parameters and shuffle phase execute the relationship of time
3) the sort_write phase execution times in reduce task are individually modeled, in spill number and data
Discrete value in the parameter space determined is measured, actual job task is executed, tests execution of the stage in different parameters
The relationship of spill number and the phase process data volume, establishes sort_write phase and holds in time and shuffle phase
The model of row time and spill number and the phase process total amount of data:
Tsw_phase=Dsw_input*(Nspill*αsw_phase+βsw_phase)
Tsw_phaseIndicate reduce phase run times, Dsw_inputIndicate single reduce task input data amounts,
NspillIndicate shuffle spill number of phase, αsw_phaseAnd βsw_phaseFor model parameter.
Step 4:Finding in model keeps the shortest parameter combination of Runtime, core procedure as follows:
1) by step 1 to step 3, a kind of description method that establishing optimization problem and analytical framework have been obtained, from
More fine-grained angle portrays running parameter and analyzes the relationship of target, which can be adapted to different algorithms.
2) it is most short that the execution time in stage in model can be obtained using all kinds of chess game optimization algorithms on the basis of this model
Parameter combination.
3) respective best parameter group is obtained in different phase search.Arameter optimization is completed, all relevant parameters are obtained
Optimum combination.
Claims (1)
1. a kind of Hadoop parameter automated tuning methods based on microoperation, this method is for optimizing MapReduce job executions
When parameter combination, which is characterized in that include the following steps:
S1, microoperation model is established:
S11, selection microoperation:MapReduce tasks are decoupled, collection phase single memory in Map tasks is selected to write behaviour
Make to shuffle stage single memory in cm_micro_op and single disk write operation cd_micro_op and Reduce task and writes behaviour
Make sm_micro_op, single memory overflows disk write operation sd_micro_op and single Piece file mergence disk write operation merge_
Micro_op is microoperation;
S12, the microoperation selected according to step S11, determine Parameters variation influential on its space;
S13, parameter value difference determine the difference of single microoperation processing data amount, and each in parameter space dimension is discrete takes
It is worth and executes actual job and tested as microoperation model reference, test single microoperation is in processing different data amount
Rate;
S14, collect benchmark test use-case in the case that the different parameters of different phase execution journal, respectively to different phase
Single disk write operation and single memory write operation establish the model that single executes time and single treatment data volume:
Tmicro_op=α * Dmicro_op+β
Tmicro_opIndicate that microoperation executes time, Dmicro_opIndicate that the data volume of microoperation single treatment, α and β are model parameter;
S2, it according to the operational process of collection phase combination is reconstructed in microoperation model obtains stage running time and system and join
Several relationships:
S21, relationship between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it;
S22, collection phase is reconstructed based on microoperation to obtain the operational process of practical collection phase, obtains collecting rank
Section relevant parameter and the stage execute the relationship of time;
S3, it according to the operational process in the stage of shuffling combination is reconstructed in microoperation model obtains stage running time and system and join
Several relationships:
S31, relationship between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it;
S32, the operational process for actually being shuffled the stage is reconstructed to the stage of shuffling based on microoperation, obtains shuffling rank
Section relevant parameter and the stage execute the relationship of time;
S33, the sequence write phase execution time in Reduce tasks is individually modeled, overflows disk write number and data in memory
Discrete value in the parameter space determined is measured, actual job task is executed, tests execution of the stage in different parameters
Time and the relationship for shuffling memory spilling disk write number and the phase process data volume in the stage are established sequence write phase and are executed
Time and memory overflow disk write number and the model of the phase process total amount of data:
Tsw_phase=Dsw_input*(Nspill*αsw_phase+βsw_phase)
Tsw_phaseIndicate sequence write phase run time, Dsw_inputIndicate single reduce tasks input data amount, NspillIt indicates
It shuffles stage memory and overflows disk write number, αsw_phaseAnd βsw_phaseFor model parameter;
S4, it looks in model and makes the shortest parameter combination of Runtime:The stage in model is obtained using chess game optimization algorithm to hold
Row time shortest parameter combination obtains respective best parameter group in different phase search.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810426699.6A CN108647135B (en) | 2018-05-07 | 2018-05-07 | Hadoop parameter automatic tuning method based on micro-operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810426699.6A CN108647135B (en) | 2018-05-07 | 2018-05-07 | Hadoop parameter automatic tuning method based on micro-operation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108647135A true CN108647135A (en) | 2018-10-12 |
CN108647135B CN108647135B (en) | 2021-02-12 |
Family
ID=63749200
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810426699.6A Expired - Fee Related CN108647135B (en) | 2018-05-07 | 2018-05-07 | Hadoop parameter automatic tuning method based on micro-operation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108647135B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427619A (en) * | 2019-07-23 | 2019-11-08 | 西南交通大学 | It is a kind of based on Multichannel fusion and the automatic proofreading for Chinese texts method that reorders |
CN111858003A (en) * | 2020-07-16 | 2020-10-30 | 山东大学 | Hadoop optimal parameter evaluation method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361183A (en) * | 2014-11-21 | 2015-02-18 | 中国人民解放军国防科学技术大学 | Microprocessor micro system structure parameter optimizing method based on simulator |
CN106383746A (en) * | 2016-08-30 | 2017-02-08 | 北京航空航天大学 | Configuration parameter determination method and apparatus of big data processing system |
US9665404B2 (en) * | 2013-11-26 | 2017-05-30 | International Business Machines Corporation | Optimization of map-reduce shuffle performance through shuffler I/O pipeline actions and planning |
CN107612886A (en) * | 2017-08-15 | 2018-01-19 | 中国科学院大学 | A kind of Spark platforms Shuffle process compresses algorithm decision-making techniques |
-
2018
- 2018-05-07 CN CN201810426699.6A patent/CN108647135B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9665404B2 (en) * | 2013-11-26 | 2017-05-30 | International Business Machines Corporation | Optimization of map-reduce shuffle performance through shuffler I/O pipeline actions and planning |
CN104361183A (en) * | 2014-11-21 | 2015-02-18 | 中国人民解放军国防科学技术大学 | Microprocessor micro system structure parameter optimizing method based on simulator |
CN106383746A (en) * | 2016-08-30 | 2017-02-08 | 北京航空航天大学 | Configuration parameter determination method and apparatus of big data processing system |
CN107612886A (en) * | 2017-08-15 | 2018-01-19 | 中国科学院大学 | A kind of Spark platforms Shuffle process compresses algorithm decision-making techniques |
Non-Patent Citations (2)
Title |
---|
SHIVNATH BABU: "《Towards automatic optimization of MapReduce programs》", 《SOCC "10: PROCEEDINGS OF THE 1ST ACM SYMPOSIUM ON CLOUD COMPUTING》 * |
童颖: "《基于机器学习的Hadoop参数调优方法》", 《中国优秀硕士学位论文全文库》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110427619A (en) * | 2019-07-23 | 2019-11-08 | 西南交通大学 | It is a kind of based on Multichannel fusion and the automatic proofreading for Chinese texts method that reorders |
CN110427619B (en) * | 2019-07-23 | 2022-06-21 | 西南交通大学 | Chinese text automatic proofreading method based on multi-channel fusion and reordering |
CN111858003A (en) * | 2020-07-16 | 2020-10-30 | 山东大学 | Hadoop optimal parameter evaluation method and device |
CN111858003B (en) * | 2020-07-16 | 2021-05-28 | 山东大学 | Hadoop optimal parameter evaluation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108647135B (en) | 2021-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lama et al. | Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud | |
Liao et al. | Gunther: Search-based auto-tuning of mapreduce | |
Herodotou et al. | No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics | |
Song et al. | A hadoop mapreduce performance prediction method | |
Garraghan et al. | An analysis of the server characteristics and resource utilization in google cloud | |
Alnafessah et al. | Quality-aware devops research: Where do we stand? | |
CN104750780B (en) | A kind of Hadoop configuration parameter optimization methods based on statistical analysis | |
Herodotou et al. | Mapreduce programming and cost-based optimization? crossing this chasm with starfish | |
CN103019855B (en) | Method for forecasting executive time of Map Reduce operation | |
CN105653647B (en) | The information collecting method and system of SQL statement | |
Fabra et al. | Reducing the price of resource provisioning using EC2 spot instances with prediction models | |
CN113157421B (en) | Distributed cluster resource scheduling method based on user operation flow | |
CN108647135A (en) | A kind of Hadoop parameter automated tuning methods based on microoperation | |
CN105630575A (en) | Performance evaluation method aiming at KVM virtualization server | |
Zhu et al. | Kea: Tuning an exabyte-scale data infrastructure | |
Pettijohn et al. | {User-Centric}{Heterogeneity-Aware}{MapReduce} Job Provisioning in the Public Cloud | |
Herodotou et al. | Automatic performance tuning for distributed data stream processing systems | |
Guo et al. | Automated exploration and implementation of distributed cnn inference at the edge | |
Bağbaba et al. | Improving the I/O performance of applications with predictive modeling based auto-tuning | |
Domaschka et al. | Hathi: an MCDM-based approach to capacity planning for cloud-hosted DBMS | |
Qi et al. | Data mining based root-cause analysis of performance bottleneck for big data workload | |
Zhang et al. | Getting more for less in optimized mapreduce workflows | |
Vallabhajosyula et al. | Establishing a Generalizable Framework for Generating Cost-Aware Training Data and Building Unique Context-Aware Walltime Prediction Regression Models | |
Sangroya et al. | Performance assurance model for hiveql on large data volume | |
Anselmo et al. | A data-centric approach for reducing carbon emissions in deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210212 |
|
CF01 | Termination of patent right due to non-payment of annual fee |