CN108647135B

CN108647135B - Hadoop parameter automatic tuning method based on micro-operation

Info

Publication number: CN108647135B
Application number: CN201810426699.6A
Authority: CN
Inventors: 滕飞; 李耘书; 李天瑞; 杜圣东
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2018-05-07
Filing date: 2018-05-07
Publication date: 2021-02-12
Anticipated expiration: 2038-05-07
Also published as: CN108647135A

Abstract

The invention belongs to the technical field of cloud computing, and particularly relates to a Hadoop parameter automatic tuning method based on micro-operation. The method mainly comprises the steps of determining mapreduce task decoupling as different stages of different types of micro-operations, then establishing a model of single execution time and single processing data volume according to the selected micro-operations, reconstructing and combining the operation process according to the established model to obtain the relation between stage operation time and system parameters, and finally searching a parameter combination which enables the task operation time to be shortest in the model. The method of the invention does not change with the change of the operation type and the cluster configuration, and simultaneously has short time consumption, high efficiency and good portability for searching the optimal parameters.

Description

Hadoop parameter automatic tuning method based on micro-operation

Technical Field

The invention belongs to the technical field of cloud computing, and particularly relates to a Hadoop parameter automatic tuning method based on micro-operation reconstruction.

Background

The problem of resource optimization of the distributed platform is always one of the hot topics focused by users, particularly, optimization of operation running time of the distributed platform is always an important research object, cloud service is popularized in recent years, operation running time is shortened, a renter can be helped to improve working efficiency, rental cost is reduced, and meanwhile a provider is helped to achieve resource utilization maximization.

In recent years, hadoop distributed computing platforms have been applied to the industry in a mature and wide range, and in the academic world, optimization of the hadoop platforms in various aspects is still a key research object. With the continuous update of hadoop versions, the computing efficiency is no longer a concern of people, a huge production cluster gradually generates expensive operation and maintenance cost, and the cost problem of companies is increasingly highlighted due to the unreasonable allocation of cloud resources, so that the cost optimization problem of the cloud distributed computing framework in the computing process is one of the problems to be solved by large IT companies at present.

There have been some research efforts directed at optimizing hadoop job run time:

1) the method and the system for automatically tuning Hadoop parameters based on machine learning are applied to Shi and Feng Dan, in Ruili.

CN106202431A.2016.

According to the method, different parameters which have large influence on different types of application are automatically obtained by clustering resource consumption characteristics of different operation types and establishing different performance models, and quantitative parameter suggested values are given. The method effectively solves the problem that the existing method based on the empirical rule highly depends on the experience of the user and the problem of limitation of qualitative parameter suggestion.

2) An iterative MapReduce job parameter auto-tuning method, Zhao ceramic Sen, high Xiaojie, Tang Hua. Cn106326005a.2017.

According to the method, the actual operation is executed and the operation execution effect is evaluated, the new parameter configuration combination is determined in the parameter space, and then the operation is executed iteratively until the ending requirement is met.

From the patent situation of the last two years, the main emphasis is to characterize the influence of the change of the parameters on the change of the working time. Another important point which is also concerned by people is the portability of the platform in Hadoop parameter automatic adjustment. The method has important practical significance if the tuning model is quickly established under different operation types of different clusters.

Disclosure of Invention

The invention aims to provide a hadoop2.0 parameter automatic optimization method based on micro-operation reconstruction, which is provided by considering the important practical significance of hadoop parameter automatic optimization in the rise of current cloud computing service.

The technical scheme adopted by the invention is as follows:

a Hadoop parameter automatic tuning method based on micro-operation is used for optimizing parameter combination during MapReduce operation execution, and is characterized by comprising the following steps:

s1, establishing a micro-operation model:

s11, selecting a micro operation: decoupling a MapReduce task, selecting a single memory write operation cm _ micro _ op and a single disk write operation cd _ micro _ op in a collection stage in the Map task, and taking a shuffle stage single memory write operation sm _ micro _ op, a single memory overflow disk write operation sd _ micro _ op and a single file merge disk write operation merge _ micro _ op in a Reduce task as micro-operations;

s12, determining the parameter change space influencing the micro-operation according to the micro-operation selected in the step S11;

s13, determining the difference of data volume processed by single micro-operation according to different parameter values, discretely taking values in each dimension in a parameter space, executing actual operation as a micro-operation model benchmark test, and testing the speed of the single micro-operation under the condition of processing different data volumes;

s14, collecting execution logs of the benchmark test case under different parameter conditions at different stages, and respectively establishing models of single execution time and single processing data volume for single disk write operation and single memory write operation at different stages:

T_{micro_op}＝α*D_{micro_op}+β

T_{micro_op}indicating micro-operation execution time, D_{micro_op}Representing the data volume of single processing of the micro-operation, and alpha and beta are model parameters;

s2, reconstructing and combining the micro-operation model according to the operation process of the collection phase to obtain the relation between the phase operation time and the system parameters:

s21, establishing the relationship between the micro-operation time and the system parameters influencing the operation according to the model of the step S14;

s22, reconstructing the collection stage based on the micro-operation to obtain the running process of the actual collection stage, and obtaining the relation between the relevant parameters of the collection stage and the running time of the stage;

s3, reconstructing and combining the micro-operation model according to the operation process of the shuffling stage to obtain the relationship between the stage operation time and the system parameters:

s31, establishing the relationship between the micro-operation time and the system parameters influencing the operation according to the model of the step S14;

s32, reconstructing the shuffling stage based on the micro-operation to obtain the running process of the actual shuffling stage, and obtaining the relation between the relevant parameters of the shuffling stage and the running time of the stage;

s33, independently modeling the execution time of the sequencing write stage in the Reduce task, discretely taking values in a parameter space determined by the write times of the memory overflow disk and the data volume, executing the actual job task, testing the execution time of the stage under different parameters and the relationship between the write times of the memory overflow disk and the data volume processed by the stage in the shuffle stage, and establishing a model of the execution time of the sequencing write stage, the write times of the memory overflow disk and the total data volume processed by the stage:

T_{sw_phase}＝D_{sw_input}*(N_spill*α_{sw_phgase}+β_{sw_phase})

T_{sw_phase}representing ordered write phase runtime, D_{sw_input}Representing the amount of input data of a single reduce task, N_spillIndicating number of memory-overflowed disk writes in shuffle stage, alpha_{sw_phase}And beta_{sw_phase}Is a model parameter;

s4, finding the parameter combination which enables the task running time to be shortest in the model: and obtaining the parameter combination with the shortest execution time in the stages in the model by adopting a search optimization algorithm, and searching in different stages to obtain the respective optimal parameter combination.

The invention has the beneficial effects that:

(1) a fine-grained micro-operation model capable of accurately depicting parameter change influence is provided. The model can visually and accurately depict the influence of the system parameter change on the execution time. The model facilitates and accurately analyzes the change of the operation execution time when multiple parameters change simultaneously from the viewpoint of data flow.

(2) A strategy for performing a micro-manipulation reconstruction in accordance with a principle of operation is presented. The method does not change along with the change of the operation type and the cluster configuration, and meanwhile, the time for searching the optimal parameters is short, the efficiency is high, and the transportability is good. The method can be used as a description method and an analysis framework of an optimization problem, and an optimal parameter combination is searched by describing a parameter change principle and establishing a model from the angle of finer granularity.

Drawings

Fig. 1 is a logic block diagram of MapReduce job in the present invention.

Detailed Description

The technical scheme of the invention is described in detail in the following with the accompanying drawings:

the method comprises the following steps: aiming at fine-grained operation directly influenced by parameters, different models are established according to different operation types, and the core steps are as follows:

1) decoupling the mapreduce task according to the mode shown in figure 1 to determine different types of micro-operations at different stages: the method comprises the following steps that (1) a collection phase single-time memory write operation cm _ micro _ op and a collection phase single-time disk write operation cd _ micro _ op are carried out; a shuffle phase single memory write operation sm _ micro _ op, a shuffle phase single spin disk write operation sd _ micro _ op, and a shuffle phase single merge disk write operation merge _ micro _ op.

2) And determining the parameter change space influencing the selected micro-operation according to the selected micro-operation. Parameters affecting the cm _ micro _ op and the cd _ micro _ op are io.sort.mb and sort.spill.percentage, and the value space is the respective value range; parameters affecting sm _ micro _ op, sd _ micro _ op and merge _ micro _ op are reduce.

3) And determining the difference of the data volume processed by the single micro-operation due to the difference of the parameter values, discretely taking the value of each dimension in the parameter space, executing the actual operation as the benchmark test of the micro-operation model, and testing the speed of the single micro-operation under the condition of processing different data volumes.

4) And collecting execution logs of the benchmark test case at different stages under different parameter conditions, and respectively establishing a model of single execution time and single processing data volume for single disk write operation and single memory write operation at different stages.

T_{micro_op}＝α*D_{micro_op}+β

The above equation is a linear model of the single execution time and the single processing data volume of the established micro-operation. T is_{micro_op}Indicating micro-operation execution time, D_{micro_op}Representing the amount of data a single pass of the micro-operation, alpha and beta are model parameters.

Step two: reconstructing and combining the micro-operation model according to the operation process of the collection phase to obtain the relationship between the phase operation time and the system parameters, wherein the core steps are as follows:

1) the data volume processed by a single micro-operation is determined by a plurality of parameters, and the relationship between the micro-operation time and the system parameters influencing the operation is established through the relationship between the collection phase micro-operation execution time and the data volume established in the step one.

2) Reconstructing the collection phase shown in FIG. 1 based on the micro-operation to obtain the actual running process of the collection phase, and depicting the relationship between the relevant parameters of the collection phase and the execution time of the collection phase.

Step three: reconstructing and combining the micro-operation model according to the running process of the shuffle phase to obtain the relation between the phase running time and the system parameters, wherein the core steps are as follows:

1) the multiple parameters jointly determine the data volume processed by a single micro-operation, and the relationship between the micro-operation time and the system parameters influencing the operation is established through the relationship between the execution time and the data volume of the shuffle phase micro-operation established in the step one.

2) Reconstructing the shuffle phase shown in FIG. 1 based on the micro-operation to obtain the running process of the actual shuffle phase, and obtaining the relationship between the parameters related to the shuffle phase and the execution time of the shuffle phase

3) Independently modeling the execution time of the sort _ write phase in the reduce task, discretely valuing in a parameter space determined by the number of spills and the data volume, executing an actual operation task, testing the execution time of the stage under different parameters and the relation between the number of spills in the shuffle phase and the data volume processed by the stage, and establishing a model of the execution time of the sort _ write phase, the number of spills and the total data processed by the stage:

T_{sw_phase}＝D_{sw_input}*(N_spill*α_{sw_phase}+β_{sw_phase})

T_{sw_phase}representing reduce phase runtime, D_{sw_input}Representing a single reduce task input data quantity, N_spillTo representNumber of shuffle phase spill, α_{sw_phase}And beta_{sw_phase}Are model parameters.

Step four: the method is characterized in that a parameter combination which enables the task running time to be shortest in a model is found, and the core steps are as follows:

1) through the first step to the third step, a description method and an analysis framework for establishing an optimization problem are obtained, the relationship between a change parameter and an analysis target is described from the perspective of finer granularity, and the framework can be adapted to different algorithms.

2) On the basis of the model, various search optimization algorithms can be applied to obtain the parameter combination with the shortest execution time of the stages in the model.

3) And searching in different stages to obtain respective optimal parameter combinations. And completing parameter tuning to obtain the optimal combination of all relevant parameters.

Claims

1. A Hadoop parameter automatic tuning method based on micro-operation is used for optimizing parameter combination during MapReduce operation execution, and is characterized by comprising the following steps:

s1, establishing a micro-operation model:

s12, according to the micro-operation selected in the step S11, determining a parameter change space which influences the micro-operation, specifically: parameters affecting the cm _ micro _ op and the cd _ micro _ op are io.sort.mb and sort.spill.percentage, and the value space is the respective value range; parameters affecting sm _ micro _ op, sd _ micro _ op and merge _ micro _ op are reduce.java.ops, shuffle.input.buffer.percentage, shuffle.merge.percentage and io.sort.factor, and the value space is the respective variation range;

T_{micro_op}＝α*D_{micro_op}+β

T_{sw_phase}＝D_{sw_input}*(N_spill*α_{sw_phase}+β_{sw_phase})

s4, finding out the parameter combination which enables the task running time to be shortest in the model: and obtaining the parameter combination with the shortest execution time in the stages in the model by adopting a search optimization algorithm, and searching in different stages to obtain the respective optimal parameter combination.