CN108647135A - A kind of Hadoop parameter automated tuning methods based on microoperation - Google Patents

A kind of Hadoop parameter automated tuning methods based on microoperation Download PDF

Info

Publication number
CN108647135A
CN108647135A CN201810426699.6A CN201810426699A CN108647135A CN 108647135 A CN108647135 A CN 108647135A CN 201810426699 A CN201810426699 A CN 201810426699A CN 108647135 A CN108647135 A CN 108647135A
Authority
CN
China
Prior art keywords
microoperation
phase
parameter
model
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810426699.6A
Other languages
Chinese (zh)
Other versions
CN108647135B (en
Inventor
滕飞
李耘书
李天瑞
杜圣东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN201810426699.6A priority Critical patent/CN108647135B/en
Publication of CN108647135A publication Critical patent/CN108647135A/en
Application granted granted Critical
Publication of CN108647135B publication Critical patent/CN108647135B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Debugging And Monitoring (AREA)
  • Organic Low-Molecular-Weight Compounds And Preparation Thereof (AREA)

Abstract

The invention belongs to field of cloud computer technology, particularly relate to a kind of Hadoop parameter automated tuning methods based on microoperation.The present invention by the decoupling of mapreduce tasks mainly by being determined as different phase different type microoperation, then the model that single executes time and single treatment data volume is established according to the microoperation of selection, combination is reconstructed to operational process according to the model of foundation and obtains the relationship of stage running time and systematic parameter, finally finding in model makes the shortest parameter combination of Runtime.The method of the present invention does not change with homework type and cluster configuration and is changed, at the same search optimized parameter take it is short, it is efficient, it is portable good.

Description

A kind of Hadoop parameter automated tuning methods based on microoperation
Technical field
The invention belongs to field of cloud computer technology, particularly relate to a kind of Hadoop parameters reconstructed based on microoperation Automated tuning method.
Background technology
Distributed platform resource optimization problem is always one of the much-talked-about topic that user pays close attention to, particularly, distributed The optimization of platform operations run time is always primary study object, universal, the shortening job run time of cloud service in recent years Leaseholder can be helped, which to improve working efficiency, reduces hiring cost, while supplier being helped to realize maximum resource utilization.
In recent years, hadoop Distributed Computing Platforms industrial quarters obtain it is ripe, be widely applied, and in science The optimization on boundary, hadoop platform various aspects is still primary study object.With the continuous renewal of hadoop versions, effect is calculated Rate has no longer been people's problem of interest, and huge production cluster gradually generates expensive O&M expense, and high in the clouds resource Unreasonable distribution is so that the cost problem of company highlights further, therefore, high in the clouds distributed computing framework in calculating process at This optimization problem is one of the problem that major IT companies are urgently to be resolved hurrily instantly.
Have some achievements in research for the optimization of hadoop job run times:
1) it puts to good use, Feng Dan, in a kind of Hadoop parameter automated tuning method and system based on machine learning of Ruili.
CN106202431A.2016.
This method is clustered by the resource consumption feature to different work type and is established different performance models, from It is dynamic to obtain the different parameters being affected to inhomogeneity application, and quantitative parameter recommendation value.This method efficiently solves existing Some is based on the method for empirical law to the restricted problem of the high Dependence Problem and qualitative parameter suggestion of user experience.
2) a kind of parameter automated tuning method of iterative type MapReduce operations of Zhao Gansen, Gao Xiaojie, Tang Hua. CN106326005A.2017.
This method determines new parameter configuration by executing actual job and assessing job execution effect in parameter space Combination is further continued for iteration and executes operation, terminates to require until meeting, this method can improve iteration MapReduce operations each time Operational efficiency brings convenience for user, and greatly reduces the waste of time resource.
In terms of nearly 2 years Patents, influence that primary focus changes the activity duration in the variation for portraying parameter. There are the emphasis that another people equally pays close attention to very much, platform transplantation in Hadoop parameter automated tunings.Different clusters are different If quickly establishing tuning model has critically important realistic meaning under homework type.
Invention content
It is to be solved by this invention, it is exactly the rise in view of cloud computing service instantly, hadoop parameter automated tunings have Important realistic meaning, propose it is a kind of based on microoperation reconstruct hadoop2.0 parameter automated tuning methods.
The technical solution adopted by the present invention is:
A kind of Hadoop parameter automated tuning methods based on microoperation, this method are held for optimizing MapReduce operations Parameter combination when row, which is characterized in that include the following steps:
S1, microoperation model is established:
S11, selection microoperation:MapReduce tasks are decoupled, collection phase single memory in Map tasks is selected Stage single memory is shuffled in write operation cm_micro_op and single disk write operation cd_micro_op and Reduce task Write operation sm_micro_op, single memory overflow disk write operation sd_micro_op and single Piece file mergence disk write operation Merge_micro_op is microoperation;
S12, the microoperation selected according to step S11, determine Parameters variation influential on its space;
S13, parameter value difference determine single microoperation processing data amount difference, in parameter space each dimension from It dissipates value and executes actual job and tested as microoperation model reference, test single microoperation is in processing different data amount situation Under rate;
S14, collect benchmark test use-case in the case that the different parameters of different phase execution journal, respectively to not same order The single disk write operation and single memory write operation of section establish the model that single executes time and single treatment data volume:
Tmicro_op=α * Dmicro_op
Tmicro_opIndicate that microoperation executes time, Dmicro_opIndicate that the data volume of microoperation single treatment, α and β are model Parameter;
S2, it combination is reconstructed according to the operational process of collection phase in microoperation model obtains the stage running time and is The relationship for parameter of uniting:
S21, pass between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it System;
S22, collection phase is reconstructed based on microoperation to obtain the operational process of practical collection phase, is received Collection stage relevant parameter and the stage execute the relationship of time;
S3, it combination is reconstructed according to the operational process in the stage of shuffling in microoperation model obtains the stage running time and is The relationship for parameter of uniting:
S31, pass between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it System;
S32, the operational process for actually being shuffled the stage is reconstructed to the stage of shuffling based on microoperation, is mixed It washes stage relevant parameter and the stage executes the relationship of time;
S33, in Reduce tasks sequence write phase execute the time individually model, memory overflow disk write number and Discrete value in the parameter space that data volume determines executes actual job task, tests the stage in different parameters It executes the time and shuffles the relationship that memory in the stage overflows disk write number and the phase process data volume, establish sequence write phase It executes the time and memory overflows disk write number and the model of the phase process total amount of data:
Tsw_phase=Dsw_input*(Nspillsw_phgasesw_phase)
Tsw_phaseIndicate sequence write phase run time, Dsw_inputIndicate single reduce tasks input data amount, Nspill Expression shuffles stage memory and overflows disk write number, αsw_phaseAnd βsw_phaseFor model parameter;
S4, it looks in model and makes the shortest parameter combination of Runtime:Model scala media is obtained using chess game optimization algorithm Duan Zhihang times shortest parameter combination obtains respective best parameter group in different phase search.
The beneficial effects of the invention are as follows:
(1) propose that a kind of fine granularity can accurately portray the microoperation model of parametric variations.The model can be intuitive accurate The influence that true describing system Parameters variation comes to executing time-bands.The model makes to multi-parameter simultaneously from the angle of data flow The analysis of job execution time change becomes convenient and accurate when variation.
(2) a kind of strategy carrying out microoperation reconstruct according to operation logic is proposed.This method is not with homework type sum aggregate Group configuration change and change, while search optimized parameter take it is short, it is efficient, portability it is good.This method can be considered a kind of excellent The description method and analytical framework of change problem portray Parameters variation principle from more fine-grained angle and establish the optimal ginseng of model searching Array is closed.
Description of the drawings
Fig. 1 is the logic diagram of the MapReduce operations in the present invention.
Specific implementation mode
Below in conjunction with the accompanying drawings, detailed description of the present invention technical solution:
Step 1:For the fine granularity operation that parameter directly affects, different models, core are established according to action type difference Steps are as follows for the heart:
1) mapreduce tasks are carried out decoupling determining different phase different type microoperation by mode as shown in Figure 1: Collection phase single memory write operation cm_micro_op, collection phase single disk write operations cd_ micro_op;Shuffle phase single memory write operation sm_micro_op, shuffle phase single spill disk writes Operate sd_micro_op, shuffle phase single merge disk write operations merge_micro_op.
2) Parameters variation influential on its space is determined according to the microoperation of selection.Influence cm_micro_op and cd_ The parameter of micro_op is io.sort.mb and sort.spill.percent, and valued space is respective value range;Influence sm_ The parameter of micro_op, sd_micro_op and merge_micro_op are reduce.java.opts, Shuffle.input.buffer.percent, shuffle.merge.percent and io.sort.factor, valued space are Respective variation range.
3) parameter value difference determines the difference of single microoperation processing data amount, and each dimension is discrete in parameter space Value simultaneously executes actual job as the test of microoperation model reference, and test single microoperation is in processing different data amount Rate.
4) collect benchmark test use-case in the case that different phase the execution journal in different parameters, respectively to not same order The single disk write operation and single memory write operation of section establish the model that single executes time and single treatment data volume.
Tmicro_op=α * Dmicro_op
Above formula is that the microoperation single established executes the linear model of time and single treatment data volume.Tmicro_opIndicate micro- Operation executes time, Dmicro_opIndicate that the data volume of microoperation single treatment, α and β are model parameter.
Step 2:Combination is reconstructed according to the operational process of collection phase in microoperation model and obtains the stage The relationship of run time and systematic parameter, core procedure are as follows:
1) multiple parameters have codetermined the data volume of single microoperation processing, the collection established by step 1 The relationship that phase microoperations execute time and data volume is established the microoperation time and is operated on it between influential systematic parameter Relationship.
2) collection phase as shown in Figure 1 are reconstructed to obtain reality based on microoperation The operational process of collection phase, portrays collection phase relevant parameters and collection phase are executed The relationship of time.
Step 3:Combination is reconstructed according to the operational process of shuffle phase in microoperation model and obtains stage fortune The relationship of row time and systematic parameter, core procedure are as follows:
1) multiple parameters have codetermined the data volume of single microoperation processing, the shuffle established by step 1 The relationship that phase microoperations execute time and data volume is established the microoperation time and is operated on it between influential systematic parameter Relationship.
2) shuffle phase as shown in Figure 1 are reconstructed to obtain practical shuffle based on microoperation The operational process of phase, obtains shuffle phase relevant parameters and shuffle phase execute the relationship of time
3) the sort_write phase execution times in reduce task are individually modeled, in spill number and data Discrete value in the parameter space determined is measured, actual job task is executed, tests execution of the stage in different parameters The relationship of spill number and the phase process data volume, establishes sort_write phase and holds in time and shuffle phase The model of row time and spill number and the phase process total amount of data:
Tsw_phase=Dsw_input*(Nspillsw_phasesw_phase)
Tsw_phaseIndicate reduce phase run times, Dsw_inputIndicate single reduce task input data amounts, NspillIndicate shuffle spill number of phase, αsw_phaseAnd βsw_phaseFor model parameter.
Step 4:Finding in model keeps the shortest parameter combination of Runtime, core procedure as follows:
1) by step 1 to step 3, a kind of description method that establishing optimization problem and analytical framework have been obtained, from More fine-grained angle portrays running parameter and analyzes the relationship of target, which can be adapted to different algorithms.
2) it is most short that the execution time in stage in model can be obtained using all kinds of chess game optimization algorithms on the basis of this model Parameter combination.
3) respective best parameter group is obtained in different phase search.Arameter optimization is completed, all relevant parameters are obtained Optimum combination.

Claims (1)

1. a kind of Hadoop parameter automated tuning methods based on microoperation, this method is for optimizing MapReduce job executions When parameter combination, which is characterized in that include the following steps:
S1, microoperation model is established:
S11, selection microoperation:MapReduce tasks are decoupled, collection phase single memory in Map tasks is selected to write behaviour Make to shuffle stage single memory in cm_micro_op and single disk write operation cd_micro_op and Reduce task and writes behaviour Make sm_micro_op, single memory overflows disk write operation sd_micro_op and single Piece file mergence disk write operation merge_ Micro_op is microoperation;
S12, the microoperation selected according to step S11, determine Parameters variation influential on its space;
S13, parameter value difference determine the difference of single microoperation processing data amount, and each in parameter space dimension is discrete takes It is worth and executes actual job and tested as microoperation model reference, test single microoperation is in processing different data amount Rate;
S14, collect benchmark test use-case in the case that the different parameters of different phase execution journal, respectively to different phase Single disk write operation and single memory write operation establish the model that single executes time and single treatment data volume:
Tmicro_op=α * Dmicro_op
Tmicro_opIndicate that microoperation executes time, Dmicro_opIndicate that the data volume of microoperation single treatment, α and β are model parameter;
S2, it according to the operational process of collection phase combination is reconstructed in microoperation model obtains stage running time and system and join Several relationships:
S21, relationship between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it;
S22, collection phase is reconstructed based on microoperation to obtain the operational process of practical collection phase, obtains collecting rank Section relevant parameter and the stage execute the relationship of time;
S3, it according to the operational process in the stage of shuffling combination is reconstructed in microoperation model obtains stage running time and system and join Several relationships:
S31, relationship between influential systematic parameter is operated according to model foundation microoperation time of step S14 and on it;
S32, the operational process for actually being shuffled the stage is reconstructed to the stage of shuffling based on microoperation, obtains shuffling rank Section relevant parameter and the stage execute the relationship of time;
S33, the sequence write phase execution time in Reduce tasks is individually modeled, overflows disk write number and data in memory Discrete value in the parameter space determined is measured, actual job task is executed, tests execution of the stage in different parameters Time and the relationship for shuffling memory spilling disk write number and the phase process data volume in the stage are established sequence write phase and are executed Time and memory overflow disk write number and the model of the phase process total amount of data:
Tsw_phase=Dsw_input*(Nspillsw_phasesw_phase)
Tsw_phaseIndicate sequence write phase run time, Dsw_inputIndicate single reduce tasks input data amount, NspillIt indicates It shuffles stage memory and overflows disk write number, αsw_phaseAnd βsw_phaseFor model parameter;
S4, it looks in model and makes the shortest parameter combination of Runtime:The stage in model is obtained using chess game optimization algorithm to hold Row time shortest parameter combination obtains respective best parameter group in different phase search.
CN201810426699.6A 2018-05-07 2018-05-07 Hadoop parameter automatic tuning method based on micro-operation Expired - Fee Related CN108647135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810426699.6A CN108647135B (en) 2018-05-07 2018-05-07 Hadoop parameter automatic tuning method based on micro-operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810426699.6A CN108647135B (en) 2018-05-07 2018-05-07 Hadoop parameter automatic tuning method based on micro-operation

Publications (2)

Publication Number Publication Date
CN108647135A true CN108647135A (en) 2018-10-12
CN108647135B CN108647135B (en) 2021-02-12

Family

ID=63749200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810426699.6A Expired - Fee Related CN108647135B (en) 2018-05-07 2018-05-07 Hadoop parameter automatic tuning method based on micro-operation

Country Status (1)

Country Link
CN (1) CN108647135B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427619A (en) * 2019-07-23 2019-11-08 西南交通大学 It is a kind of based on Multichannel fusion and the automatic proofreading for Chinese texts method that reorders
CN111858003A (en) * 2020-07-16 2020-10-30 山东大学 Hadoop optimal parameter evaluation method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361183A (en) * 2014-11-21 2015-02-18 中国人民解放军国防科学技术大学 Microprocessor micro system structure parameter optimizing method based on simulator
CN106383746A (en) * 2016-08-30 2017-02-08 北京航空航天大学 Configuration parameter determination method and apparatus of big data processing system
US9665404B2 (en) * 2013-11-26 2017-05-30 International Business Machines Corporation Optimization of map-reduce shuffle performance through shuffler I/O pipeline actions and planning
CN107612886A (en) * 2017-08-15 2018-01-19 中国科学院大学 A kind of Spark platforms Shuffle process compresses algorithm decision-making techniques

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9665404B2 (en) * 2013-11-26 2017-05-30 International Business Machines Corporation Optimization of map-reduce shuffle performance through shuffler I/O pipeline actions and planning
CN104361183A (en) * 2014-11-21 2015-02-18 中国人民解放军国防科学技术大学 Microprocessor micro system structure parameter optimizing method based on simulator
CN106383746A (en) * 2016-08-30 2017-02-08 北京航空航天大学 Configuration parameter determination method and apparatus of big data processing system
CN107612886A (en) * 2017-08-15 2018-01-19 中国科学院大学 A kind of Spark platforms Shuffle process compresses algorithm decision-making techniques

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHIVNATH BABU: "《Towards automatic optimization of MapReduce programs》", 《SOCC "10: PROCEEDINGS OF THE 1ST ACM SYMPOSIUM ON CLOUD COMPUTING》 *
童颖: "《基于机器学习的Hadoop参数调优方法》", 《中国优秀硕士学位论文全文库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427619A (en) * 2019-07-23 2019-11-08 西南交通大学 It is a kind of based on Multichannel fusion and the automatic proofreading for Chinese texts method that reorders
CN110427619B (en) * 2019-07-23 2022-06-21 西南交通大学 Chinese text automatic proofreading method based on multi-channel fusion and reordering
CN111858003A (en) * 2020-07-16 2020-10-30 山东大学 Hadoop optimal parameter evaluation method and device
CN111858003B (en) * 2020-07-16 2021-05-28 山东大学 Hadoop optimal parameter evaluation method and device

Also Published As

Publication number Publication date
CN108647135B (en) 2021-02-12

Similar Documents

Publication Publication Date Title
Lama et al. Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud
Liao et al. Gunther: Search-based auto-tuning of mapreduce
Herodotou et al. No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics
Song et al. A hadoop mapreduce performance prediction method
Garraghan et al. An analysis of the server characteristics and resource utilization in google cloud
Alnafessah et al. Quality-aware devops research: Where do we stand?
CN104750780B (en) A kind of Hadoop configuration parameter optimization methods based on statistical analysis
Herodotou et al. Mapreduce programming and cost-based optimization? crossing this chasm with starfish
CN103019855B (en) Method for forecasting executive time of Map Reduce operation
CN105653647B (en) The information collecting method and system of SQL statement
Fabra et al. Reducing the price of resource provisioning using EC2 spot instances with prediction models
CN113157421B (en) Distributed cluster resource scheduling method based on user operation flow
CN108647135A (en) A kind of Hadoop parameter automated tuning methods based on microoperation
CN105630575A (en) Performance evaluation method aiming at KVM virtualization server
Zhu et al. Kea: Tuning an exabyte-scale data infrastructure
Pettijohn et al. {User-Centric}{Heterogeneity-Aware}{MapReduce} Job Provisioning in the Public Cloud
Herodotou et al. Automatic performance tuning for distributed data stream processing systems
Guo et al. Automated exploration and implementation of distributed cnn inference at the edge
Bağbaba et al. Improving the I/O performance of applications with predictive modeling based auto-tuning
Domaschka et al. Hathi: an MCDM-based approach to capacity planning for cloud-hosted DBMS
Qi et al. Data mining based root-cause analysis of performance bottleneck for big data workload
Zhang et al. Getting more for less in optimized mapreduce workflows
Vallabhajosyula et al. Establishing a Generalizable Framework for Generating Cost-Aware Training Data and Building Unique Context-Aware Walltime Prediction Regression Models
Sangroya et al. Performance assurance model for hiveql on large data volume
Anselmo et al. A data-centric approach for reducing carbon emissions in deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210212

CF01 Termination of patent right due to non-payment of annual fee