CN103761146A

CN103761146A - Method for dynamically setting quantities of slots for MapReduce

Info

Publication number: CN103761146A
Application number: CN201410004521.4A
Authority: CN
Inventors: 宗栋瑞; 郭美思
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-01-06
Filing date: 2014-01-06
Publication date: 2014-04-30
Anticipated expiration: 2034-01-06
Also published as: CN103761146B

Abstract

The invention provides a method for dynamically setting the quantities of slots for MapReduce. The method includes steps of firstly, setting the quantities of the slots according to computing power of nodes in clusters; secondly, properly adjusting the quantity of the corresponding slots according to the condition of memories in each node. Compared with the prior art, the method for dynamically setting the quantities of the slots for the MapReduce has the advantages that the running performance of mapreduce programs can be improved, reasonable utilization of resources can be optimized, and the method is high in practicality and easy to popularize.

Description

A kind of method of MapReduce dynamic setting slots quantity

Technical field

The present invention relates to field of computer technology, specifically a kind of method of MapReduce dynamic setting slots quantity.

Background technology

Internet technology development of today, data become explosive growth, and on network, data scale sharply increases, and in chaotic data, is containing huge business opportunity, can be worth from the extracting data of magnanimity.But the data-handling capacity that thing followed problem is unit cannot meet the processing requirements of current mass data application, and the Distributed Calculation based on large-scale calculations cluster becomes the main path of Future Data performance boost.Core technology MapReduce computation model for Hadoop is studied, and has proposed a kind of strategy of MapReduce dynamic setting slots quantity for map, the reduce quantity problem of default setting same number in each node in MapReduce.According to the hardware configuration difference of different nodes in cluster, different map quantity and reduce quantity are set.

For map number in mapreduce and reduce number, be set as follows at present: the quantity of map task is the parameter value of mapred.tasktracker.map.tasks.maximu, but a TaskTracker can configure how many slot, or relevant with its physical environment.Each task is independently carried out by the JVM newly starting, and just has a plurality of JVM when having a plurality of task, and each JVM consumes a part of internal memory, adds the memory consumption of DataNode and TaskTracker, and machine internal memory possibility will be not enough.Except considering the internal memory restriction of each new startup JVM of allotment, must close to pour down, need on earth how many new JVM, the namely numbers of map slot and reduce slot of starting like this.Their setting is also relevant with the processor number of machine.Concrete configuration must carry out observation and analysis from the actual motion effect of cluster.The size of Input Split, has determined that a Job has how many map.Yet if the data volume of input is huge, the block of acquiescence has several ten thousand Map Task of hundreds of thousands even so, the Internet Transmission of cluster can be very large, and the most serious is to Job Tracker scheduling, queue, internal memory all can bring very large pressure.Therefore to set the slots quantity that suitably meets machine computing power.

In Hadoop, use slot to represent the resource on each TaskTraker, a slot represents fixing combination of resources, when carrying out mapreduce program, the Map slot number on each TaskTracker and Reduce slot number are to be configured by mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum.Once after these two parameter configurations, can not on-the-fly modify.Because the stock number that task of same-action does not need is different, the node hardware configuration in cluster is also not quite similar, and therefore, for the difference of node resource, proposes a kind of strategy of MapReduce dynamic setting slots quantity.This strategy can, according to node computing power dynamic setting slot quantity, improve the performance that MapReduce program is carried out.

Summary of the invention

Technical assignment of the present invention is to solve the deficiencies in the prior art, and a kind of method of MapReduce dynamic setting slots quantity is provided.

Technical scheme of the present invention realizes in the following manner, the method for this kind of MapReduce dynamic setting slots quantity, and its concrete assignment procedure is:

First determine the quantity of CPU in clustered node, then according to the quantity of the core of CPU in each node, by master slave mode framework MapReduce dynamic setting, determine slots quantity: according to the resource situation of job queue and TaskTracker node as input, wherein the resource situation of TaskTracker comprises the core amounts of CPU and the memory size of node, and then sets slots quantity according to the computing power of node;

On the host node of MS master-slave pattern framework MapReduce, move JobTracker, it is responsible for monitoring a group of planes, task scheduling; From node, move TaskTracker, it is responsible for monitor task and carries out, report progress;

TaskTracker regularly sends heartbeat message, the resource service condition of carrying this node in this information to JobTracker;

When heartbeat arrives, the scheduling in host node occurs, if the own available free resource of TaskTracker report, JobTracker is used dispatching algorithm to select a task to be transmitted into this node operation.

When setting slots quantity, need to design two variablees, one is map slot, one is reduce slot: first revise the code in TaskTracker, by map slot quantity initial setting, be the core amounts of CPU in node, reduce slot quantity initial setting is half of core amounts of CPU in node; Then in class methods, according to slots quantity, decide the size of application internal memory, total Memory Allocation size of task equals in map slot quantity and TaskTracker that single map slot memory size is long-pendingly adds in resuce slot quantity and TaskTracker that single reduce slot memory size is to be amassed; If it is little that total Memory Allocation of task is compared with the free memory of respective nodes in cluster, slots is set as to this value; If the free memory of respective nodes is little in total Memory Allocation of task and cluster, reduce map slot quantity or reduce slot quantity, the less slots quantity replacing, until meet internal memory condition in node.

The beneficial effect that the present invention compared with prior art produced is:

The method of a kind of MapReduce dynamic setting slots quantity of the present invention is by analyzing the computing power of node in Hadoop cluster, utilize CPU and the internal memory situation of each node to determine slots quantity, then according to this quantity, obtain rational map quantity and reduce quantity, the performance that this strategy makes whole cluster process MapReduce task promotes greatly, and optimize the reasonable utilization of resource, practical, be easy to promote.

Accompanying drawing explanation

Accompanying drawing 1 is operation job flowchart of the present invention.

Accompanying drawing 2 is process flow diagrams of setting slots quantity of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the method for a kind of MapReduce dynamic setting slots quantity of the present invention is described in detail below.

The present invention relates to MapReduce in current large data Hadoop cluster and be badly in need of the major issue solving, according to node hardware configuration difference, computing power Different Dynamic in cluster, set the problem of map, reduce quantity.The strategy of the MapReduce dynamic setting slots quantity proposing by this method, this strategy can effectively solve the problem of dynamic setting slots quantity, and the performance that makes whole cluster process MapReduce task promotes greatly.

The present invention depends on MS master-slave pattern framework MapReduce, and this framework adopts the framework of Master/Slave, and it mainly contains following 4 parts and forms:

1）Client。

2) JobTracker:JobTracke is responsible for monitoring resource and job scheduling.JobTracker monitors the health status of all TaskTracker and job, once find unsuccessfully, just corresponding task transfers is arrived to other nodes; Meanwhile, the information such as the implementation progress of JobTracker meeting tracing task, resource use amount, and tell task dispatcher by these information, and scheduler can, when the free time appears in resource, select suitable task to use these resources.In Hadoop, task dispatcher is a pluggable module, and user can design corresponding scheduler according to the needs of oneself.

3) TaskTracker:TaskTracker can periodically report the operation progress of the service condition of resource on this node and task to JobTracker by Heartbeat, receives order the corresponding operation of execution (as started new task, killing task dispatching) that JobTracker sends over simultaneously.TaskTracker is used " slot " equivalent to divide the stock number on this node." slot " represents computational resource (CPU, internal memory etc.).A Task just has an opportunity to move after getting a slot, and the effect of Hadoop scheduler is exactly the idle slot on each TaskTracker to be distributed to Task use.Slot is divided into two kinds of Map slot and Reduce slot, respectively for MapTask and Reduce Task.TaskTracker limits the concurrency of Task by slot number (configurable parameter).

4) Task:Task is divided into two kinds of Map Task and Reduce Task, by TaskTracker, starts.HDFS be take the block of fixed size and is base unit storage data, and for MapReduce, it processes unit is split.Split is a logical concept, and it only comprises some metadata informations, such as data reference position, data length, data place node etc.Its division methods is determined by user oneself completely.But it should be noted that split number determined the number of Map Task because each split only can give a Map Task, process.

As shown in accompanying drawing 1, Fig. 2, the method of a kind of MapReduce dynamic setting slots quantity provided by the invention, this strategy is mainly to set slots quantity according to computing power in clustered node, and node computing power is determined according to CPU number and two factors of internal memory.First determine the quantity of CPU in clustered node, then according to the quantity of the core of CPU in each node, determine slots quantity, can carry out Processing tasks according to different node computing powers like this, mapreduce task is carried out more efficiently, improve performance.Internal memory factor in the strategy of MapReduce dynamic setting slots quantity, according to slots quantity, to decide the size of application internal memory, according to the internal memory situation of node, adjust accordingly slots quantity again, if can reduce slots quantity during low memory in application process, know the internal memory condition that reaches, otherwise slots quantity is set as to the slots quantity of setting according to CPU quantity, finally according to slots quantity, determines map, reduce quantity.Its concrete assignment procedure is:

The object of the invention is to carry out dynamic setting slots quantity for distributed computing framework.This tactful thought is to carry out dynamic setting slots quantity according to the computing power difference of each node in Hadoop cluster.The CPU having from node and internal memory situation are set map quantity and reduce quantity, and this technical matters is connecting CPU quantity in node and slots reasonable quantity; By the restriction of internal memory, retrain the quantity of slots, make to meet the processing power of node in cluster, make task more efficient.

In node in the contacting of CPU quantity and slots reasonable quantity, add up the CPU quantity of each node, slots quantity is arranged to the core quantity of CPU in node, because each core can process separately a Task, and need not wait for, when map Task or reduce Task execution, can carry out fast.

At internal memory, limit approximately intrafascicular, can decide according to slots quantity the size of application internal memory, according to the internal memory situation of node, adjust accordingly slots quantity again, if can reduce slots quantity during low memory in application process, until reach the requirement of internal memory restriction, otherwise slots quantity is set as to the slots quantity according to the setting of CPU quantity.

With reference to the accompanying drawings 1 and accompanying drawing 2, content of the present invention is described in detail with an instantiation.

First dispose distributed type assemblies environment, use has a Hadoop group of planes for 11 nodes, one of them node is as master, all the other ten as slave. wherein 10 nodes all adopt Xeon E5-2620@2.00GHz CPU, the quantity of core is 24,96GB internal memory, 12*2T hard disk, operating system is centos6.3, the configuration of another one node is Xeon E7-8837@2.67GHz CPU, and the quantity of core is 128,500GB internal memory, 5*2T hard disk, operating system is centos6.3.In operating system, be according to official's document, hadoop assembly to be installed on centos6.3.Then hdfs, mapreduce are served to unlatching.

Operation job flowchart as shown in Figure 1, first determines that input file or the catalogue of MapReduce should exist on File system, if MapReduce depends on HDFS, must first local file be uploaded on HDFS.Client can apply for that a Jobid is used as the identifier of job to JobTracker.Then MapReduce just need to carry out job necessary resource file and copies on HDFS.Next be only operation job and submit process to, input file is done to data fragmentation (input split).Data fragmentation is just to determine the scope of its deal with data for before carrying out at mapper, and the quantity of the quantity of burst decision map task, corresponding one by one between them.This data fragmentation (split) is logic burst just, and record it and should access which block, and the initial index on this block and the information of data length.Then initialization operation, JobTracker will be responsible for distributed tasks to TaskTracker, TaskTracker can periodically send heartbeat request to JobTracker when operation, reports the upper task executing state of status data, TaskTracker of TaskTracker and wishes to obtain the task that can carry out from JobTracker.And the map quantity of moving in real TaskTracker node and reduce quantity are determined by map slots and reduce slots quantity.Therefore, according to the computing power of respective nodes in cluster, determine that in each node, map slots and reduce slots quantity are very important, directly affect the operational efficiency of task.

Set the process flow diagram of slots quantity as shown in Figure 2, first obtain the core quantity of the CPU of each node in cluster, map slot quantity initial setting is the core quantity of CPU in node, and reduce slot quantity initial setting is half of core quantity of CPU in node; Then obtain the free memory size in each node, in class methods initializeMemoryManagement (), according to slots quantity, decide the size of application internal memory, total Memory Allocation size of task equals in map slot quantity and TaskTracker that single map slot memory size is long-pendingly adds in reduce slot quantity and TaskTracker that single reduce slot memory size is to be amassed.If it is little that total Memory Allocation of task is compared with the free memory of respective nodes in cluster, map slots is set as to the core quantity of CPU in node, reduce slot quantity is half of map slot quantity; In total Memory Allocation of task and cluster, the free memory of respective nodes is little else if, reduce map slot quantity or reduce slot quantity, less slots quantity alternately, until meet internal memory condition in node, at this moment the map slots quantity that map slots is set as satisfying condition, reduce slot quantity is the reduce slots quantity satisfying condition.Then according to according to two TaskLauncher threads in class methods TaskTracker.initialize (), be responsible for respectively starting Mapper and Reduce task, in TaskLauncher, need to import into corresponding slots quantity, then carry out corresponding Task, as map task or reduce task.After execution finishes, discharge the resource of occupying.The method decides the computing power of node with the core quantity of CPU and memory size, many and large larger map and the reduce quantity of Node configuration of internal memory for the core quantity of CPU in node, for less map and the reduce quantity of core quantity Node configuration few and that internal memory is relatively less of CPU in some nodes.In this cluster, adopt Xeon E5-2620@2.00GHz CPU, the quantity of core be 24,96GB internal memory 10 nodes all map be set to 24, reduce and be set to 12.Another node configuration is Xeon E7-8837@2.67GHz CPU, and the quantity of core is 128,500GB internal memory, and it is 64 that map is set to 128, reduce.Arranging is like this higher than the tasks carrying efficiency of the map quantity of each machine Node configuration and reduce quantity, reaches the reasonable utilization of optimizing resource simultaneously.

The foregoing is only embodiments of the invention, within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a method for MapReduce dynamic setting slots quantity, is characterized in that its concrete assignment procedure is:

2. the method for a kind of MapReduce dynamic setting slots quantity according to claim 1, it is characterized in that: when setting slots quantity, need to design two variablees, one is map slot, one is reduce slot: first revise the code in TaskTracker, by map slot quantity initial setting, be the core amounts of CPU in node, reduce slot quantity initial setting is half of core amounts of CPU in node; Then in class methods, according to slots quantity, decide the size of application internal memory, total Memory Allocation size of task equals in map slot quantity and TaskTracker that single map slot memory size is long-pendingly adds in resuce slot quantity and TaskTracker that single reduce slot memory size is to be amassed; If it is little that total Memory Allocation of task is compared with the free memory of respective nodes in cluster, slots is set as to this value; If the free memory of respective nodes is little in total Memory Allocation of task and cluster, reduce map slot quantity or reduce slot quantity, the less slots quantity replacing, until meet internal memory condition in node.