CN103106253A

CN103106253A - Data balance method based on genetic algorithm in MapReduce calculation module

Info

Publication number: CN103106253A
Application number: CN2013100159884A
Authority: CN
Inventors: 伍卫国; 樊源泉; 魏伟; 朱霍; 高颜
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2013-01-16
Filing date: 2013-01-16
Publication date: 2013-05-15
Anticipated expiration: 2033-01-16
Also published as: CN103106253B

Abstract

Provided is a data balance method based on genetic algorithm in a MapReduce calculation module. The data balance method based on the genetic algorithm in the MapReduce calculation module includes: obtaining global Map output information, utilizing the genetic algorithm to conduct combination optimization, collecting and coding metadata, conducting multiple random partition on population, forming a genome through each partition, calculating fitness function values of all subsets in each gene, applying a selection operator to a genome on the basis of evaluating fitness of each gene, utilizing a roulette algorithm to choose a plurality of high quality genes in the genome at random, conducting cross operation on the chosen genes, conducting mutation operation, choosing retained genes according to an elitism strategy after multiple evolutions, decoding the genes to obtain a optical combination of the metadata and guaranteeing that each data quantity which is processed by the reducer is approximate equal. The data balance method based on genetic algorithm in the MapReduce calculation module solves the problem of unbalance input data in the reduce phrase, saves calculation resource and reduces calculation cost.

Description

In a kind of MapReduce computation model based on the data balancing method of genetic algorithm

Technical field

The invention belongs to computing machine MapReduce computation model technical field, be specifically related in a kind of MapReduce computation model the data balancing method based on genetic algorithm.

Background technology

Hadoop is by increase income storage and a distributed paralleling calculation platform with high reliability and enhanced scalability of organization development of Apache, develop as the basic platform of the search engine project Nutch that increases income the earliest, independent from the Nutch project afterwards, become one of the cloud computing platform of typically increasing income.The Hadoop core has realized by the distributed file system (Hadoop Distributed File System, HDFS) of piece storage and the MapReduce computation model that is used for Distributed Calculation.

The processing stage that the MapReduce computation model being divided into the large task of two of Map and Reduce.In the MapReduce processing procedure, the Map stage will be inputted data and change into＜Key, Value〉data mode of key-value pair, offer the Reduce stage to be further processed.Before Reduce accepts the key-value pair data of Map output and it is processed, also need through a Shuffle stage.The Shuffle stage mainly shuffles the output data of each Map task, and collects in these Map tasks output data the data that need to be processed by same reduce task.Because the data scale of collecting may be larger, the Shuffle stage can merge data in the local file system store reduce task place node into, thereby reduces the memory headroom occupancy.

Each Map task will export according to the quantity of reduce task the subregion number that data are divided into equal parts, single reduce task is collected corresponding with it partition data from all Map tasks, all Map output key-value pairs that possess identical key value will be assigned to same reduce task and process, thereby the final process result that guarantees each reduce is based upon on global scope.

The characteristics in Shuffle stage have determined that the data volume that each reduce task of Reduce stage is accepted might be extremely uneven, thereby cause the Reduce stage to calculate the problem of inclination.

1) Reduce that is caused by the User Defined partitioning strategies calculates

When the MapReduce operation is submitted to, according to the partitioning strategies of appointment, the Map stage need to be divided the number of output subregion, sets up the corresponding relation between Map output and reduce input.User-defined partitioning strategies is according to practical application request, the data that will be mutually related are divided in same subregion, complete processing by same reduce task, guarantee the correctness of net result, but also may cause each reduce task deal with data amount uneven simultaneously.

When the concrete subregion of data is indifferent in the MapReduce operation, for completing fast minute Division of Map output data, what usually adopt is hash subregion method, hash value by Key is determined whole＜Key, Value〉the affiliated partition number of key-value pair, i.e. partition number partitionNum=hashCode (Key) %REDUCER_NUM.This method is limited by hash and calculates the factors such as conflict and reduce Limited Number, a large amount of key probably occurs to converge on same subregion, causes the data volume on each reduce task uneven.

2) Reduce that is caused by input data unique characteristics calculates

Due to division operation at each Map＜Key, Value carry out after key-value pair data output, determine its district location according to some characteristic of Key often, lack the global statistics information of Key correspondence Value data scale.Therefore, even the quantity that partitioning strategies can guarantee key in each subregion is balance roughly, but input the own characteristic of data due to the Map stage, the corresponding Value data volume of some specific key is measured much larger than Value corresponding to other key, thereby causes part reduce task data volume to be processed excessive.This phenomenon comes across the situation that has some hot spot datas in the input data usually.Generally, the input data skew in Reduce stage will make some reduce task increase with respect to other reduce task execution times, extend the working time in whole Reduce stage, finally affect the deadline of whole MapReduce operation.

Summary of the invention

In order to overcome the shortcoming of above-mentioned prior art, the object of the present invention is to provide in a kind of MapReduce computation model the data balancing method based on genetic algorithm, reduced the processing time of task reducer, and then reduced processing time of whole MapReduce, can well save computational resource and minimizing assesses the cost.

In order to achieve the above object, the technical scheme taked of the present invention is:

Based on the data balancing method of genetic algorithm, comprise the following steps in a kind of MapReduce computation model:

1), obtain overall Map output information, obtain the metadata information of the subregion that the reduce task processes, the acquisition process of Reduce metadata is:

1.1, each Map task after completing processing procedure and Output rusults write local disk, can utilize heartbeat message transmission task to complete message to JobTracker by TaskTracker;

1.2, JobTracker safeguards a Map task for each MapReduce operation and completes message queue, when certain moves the TaskTracker acquisition request Map task of reduce task, according to the operation under this reduce task, take out message and pass to TaskTracker from respective queue;

1.3, the reduce task in same operation obtains the Map task from the TaskTracker at place and completes message, the information during operation of therefrom extracting the Map task, comprise the Map mission number, XM, utilize these information, the reduce task creation is connected with HTTP between XM, and the metadata information of request Map task output;

1.4, TaskTracker is according to the Map mission number of request, read the index file of corresponding Map task output from local file system, and send to the reduce task of request;

1.5, the identical numbering virtual partition in reduce task merging different index file, gather all same kind＜Key in each virtual partition, Value〉data volume of key-value pair, to obtain the metadata information of all map tasks outputs due to each reduce task;

2), the output data of Map are processed, the reduce task is obtained the subregion raw data of each map task output; Metadata after gathering is submitted to the repartition device, adopts genetic algorithm to carry out equilibrium to metadata, and genetic algorithm is that bit string is operated, and its concrete steps are as follows:

2.1, metadata collecting that Map is exported data gets up to be placed in a set, as a population, each element in population is encoded, each element of coded representation that so-called coding uses " 0,1 " to form exactly, the coded system that adopts is to represent the element place subscript in gathering with 1 number, this population is carried out random division, be divided into the N subset, wherein N is corresponding with the number of reduce, division each time forms a gene, after repeatedly dividing, form a genome;

2.2, fitness function is to weigh the individual adaptedness for living environment of heredity in genetic algorithm, the individuality that fitness is higher obtains more duplicator meeting, vice versa, therefore, defines a fitness function

\min {Σ_{j = 1}^{n} | S_{j} - S |} / n

Formula (1)

, wherein,

Be whole mean value of the element sum of subsets, what in formula (1), objective function was described is the mean distance that each subset is incorporated into mean value, utilize this formula (1), each gene is calculated its fitness function, form a new set, then obtain the probability of each Gene sufficiency function, namely the value of the fitness function of a gene is divided by whole genomic fitness function value sum;

2.3, will select operator to be applied to genome, the selection operator that adopts is roulette wheel selection, utilize random function to produce one [0,1] random number between, judge the position in its fitness probability sequence in genome, if its multipotency represents that greater than m in sequence value the m gene is selected, freely specify the number that needs the gene selected;

2.4, carry out crossing operation to electing gene, namely the part-structure of Fineness gene is replaced to reconfigure and formed new gene, adopt the single-point crossover operator, concrete operations are: set at random a point of crossing, the gene that corresponding roulette selection algorithm chooses, intersect, namely the part-structure of two genes before and after this point of crossing exchanges, and generate two new individual, and the genome after guaranteeing to exchange can not have the situation of null set, set a nullGen sign, genome after traversal is intersected, if find to have null set to exist, be about to the nullGen sign and be set to false, and identify the gene of this deletion with this,

2.5, to the computing that makes a variation of the gene after intersecting, thereby the variation computing is according to the variation probability, some gene in genome to be replaced with other gene to form a new individuality, adopt the fixed bit mutation operator, and the probability that will make a variation is made as 0.1, to obtaining optimum solution, the fixed bit mutation operator refers to a certain position or a few genes of the fixing appointment of individual gene are made mutation operation: original gene is 0, become 1, original gene is 1, become 0, through after mutation operation, gene after variation is carried out non-NULLCHECK, guarantee that the gene after compiling still has the N subset,

2.6, abovely described one and taken turns evolutionary process, after evolving, too much wheel selects the gene of reservation according to elite's retention strategy, the gene retention strategy that adopts is: through after above step, calculate the target function value of each gene, and it is compared with the target function value of all genes in genome, the former is remained less than the latter's gene;

2.7, the gene that remains is decoded, just can obtain the combination to an optimization of metadata, be about to metadata and be divided into N the subset that size is substantially suitable, then, on the data allocations to that every subset is a corresponding reducer, so just guarantee that the handled data volume of each reducer is approximately equalised.

The invention has the beneficial effects as follows:

Calculate tilt problem for the Reduce stage that exists in the MapReduce platform, solution has been proposed, the method utilizes genetic algorithm to carry out repartition by being exported data the Map stage, the data volume of guaranteeing each subregion is unanimous on the whole, make the reduce task use more efficiently the resource of system, avoided because reducer inputs the inconsistent of uneven processing time of causing of data volume, thereby reduced the processing time of task reducer, and then reduced the processing time of whole MapReduce.From the business aspect, new method can well save computational resource and minimizing assesses the cost.

Description of drawings

Fig. 1 Reduce metadata is obtained process flow diagram.

Fig. 2 Map output metadata acquisition module class figure.

Fig. 3 is based on the process flow diagram of the data balancing method of genetic algorithm.

Embodiment

The present invention is described in detail below in conjunction with accompanying drawing.

1), obtain overall Map output information, obtain the metadata information of the subregion that the reduce task processes, the acquisition process of Reduce metadata as shown in Figure 1:

1.5, the identical numbering virtual partition in reduce task merging different index file, gather all same kind＜Key in each virtual partition, Value〉data volume of key-value pair, to obtain the metadata information of all map task outputs due to each reduce task, consider in practical situation, map task number is usually more, and be distributed on a plurality of computing nodes, accelerate the metadata acquisition process for raising the efficiency, adopt multithreading to complete this process in can realizing, the main class formation of Map output metadata acquisition module as shown in Figure 2;

2), the output data of Map are processed, the reduce task is obtained the subregion raw data of each map task output; Metadata after gathering is submitted to the repartition device, in order to make the big or small basically identical of input data volume that each reducer obtains, the present invention adopts genetic algorithm, metadata is carried out equilibrium, genetic algorithm is that bit string is operated, rather than to data itself, its concrete steps are as follows:

2.1, metadata collecting that Map is exported data gets up to be placed in a set, as a population, each element in population is encoded, each element of coded representation that so-called coding uses " 0,1 " to form exactly, the coded system that the present invention adopts is to represent the element place subscript in gathering with 1 number, this population is carried out random division, be divided into the N subset, wherein N is corresponding with the number of reduce, division each time forms a gene, after repeatedly dividing, form a genome;

2.2, fitness function is to weigh the individual adaptedness for living environment of heredity in genetic algorithm, the individuality that fitness is higher obtains more duplicator meeting, vice versa, therefore, the present invention defines a fitness function

\min {Σ_{j = 1}^{n} | S_{j} - S |} / n

Formula (1)

, wherein,

2.3, to select operator to be applied to genome, the selection operator that the present invention adopts is roulette wheel selection, roulette wheel selection is a kind of random system of selection commonly used, be similar to the roulette in the gambling game, its main thought is the probability that the ideal adaptation degree is converted to selection in proportion, the ratio shared by individuality carries out ratio cut partition on disk, each rotary disk, treat that it is the individuality of choosing that disk stops individuality corresponding to backpointer stop sector, adopt the benefit of this selection algorithm to be, individual probability is larger, the area occupied ratio of this individuality in disk is also larger, selected probability is also just larger, utilize this thought, specific implementation of the present invention is: utilize random function to produce one [0, 1] random number between, judge the position in its fitness probability sequence in genome, if its multipotency is greater than m in sequence value, represent that the m gene is selected, generally can freely specify the number of the gene that needs selection,

2.4, several genes of electing are carried out crossing operation, namely the part-structure of Fineness gene is replaced to reconfigure and formed new gene, crossing operation is the key character that genetic algorithm is different from other evolution algorithms, the present invention adopts the single-point crossover operator, concrete operations are: set at random a point of crossing, the gene that corresponding roulette selection algorithm chooses, intersect, namely the part-structure of two genes before and after this point of crossing exchanges, and generate two new individual, and the genome after guaranteeing to exchange can not have the situation of null set, set a nullGen sign, genome after traversal is intersected, if find to have null set to exist, be about to the nullGen sign and be set to false, and identify the gene of this deletion with this,

2.5, to the computing that makes a variation of the gene after intersecting, thereby the variation computing is according to the variation probability, some gene in genome to be replaced with other gene to form a new individuality, the purpose that genetic algorithm is introduced variation has two: the one, and make genetic algorithm have local random searching ability, when genetic algorithm by crossover operator during near optimal solution neighborhood, utilize this local random searching ability of mutation operator can accelerate to restrain to optimum solution, obviously, variation probability in such cases should be got smaller value, otherwise the building block near optimum solution can be destroyed because of variation, the 2nd, make genetic algorithm can keep population diversity, to prevent the prematurity Convergent Phenomenon, this moment, convergent probability should be got higher value, based on above consideration, the present invention adopts the fixed bit mutation operator, and the probability that will make a variation is made as 0.1, to obtaining optimum solution, the fixed bit mutation operator refers to a certain position or a few genes of the fixing appointment of individual gene are made mutation operation: original gene is 0, become 1, original gene is 1, become 0, through after mutation operation, gene after variation is carried out non-NULLCHECK, guarantee that the gene after compiling still has the N subset,

2.6, abovely described one and taken turns evolutionary process, after evolving, too much wheel selects the gene of reservation according to elite's retention strategy, the gene retention strategy that the present invention adopts is: through after above step, calculate the target function value of each gene, and it is compared with the target function value of all genes in genome, the former is remained less than the latter's gene;

2.7, the gene that remains is decoded, just can obtain the combination to an optimization of metadata, be about to metadata and be divided into N the subset that size is substantially suitable, then, on the data allocations to that every subset is a corresponding reducer, so just can guarantee that the handled data volume of each reducer is suitable, well solve the problem that the reduce stage inputs data skew.In the MapReduce computation model, a kind of process flow diagram of the data balancing method based on genetic algorithm as shown in Figure 3.

Claims

In a MapReduce computation model based on the data balancing method of genetic algorithm, it is characterized in that, comprise the following steps:

1), obtain overall Map output information, obtain the metadata information of the subregion that the reduce task processes, the acquisition process of Reduce metadata is:

1.1, each Map task after completing processing procedure and Output rusults write local disk, can utilize heartbeat message transmission task to complete message to JobTracker by TaskTracker;

1.2, JobTracker safeguards a Map task for each MapReduce operation and completes message queue, when certain moves the TaskTracker acquisition request Map task of reduce task, according to the operation under this reduce task, take out message and pass to TaskTracker from respective queue;

1.3, the reduce task in same operation obtains the Map task from the TaskTracker at place and completes message, the information during operation of therefrom extracting the Map task, comprise the Map mission number, XM, utilize these information, the reduce task creation is connected with HTTP between XM, and the metadata information of request Map task output;

1.4, TaskTracker is according to the Map mission number of request, read the index file of corresponding Map task output from local file system, and send to the reduce task of request;

1.5, the identical numbering virtual partition in reduce task merging different index file, gather all same kind＜Key in each virtual partition, Value〉data volume of key-value pair, to obtain the metadata information of all map tasks outputs due to each reduce task;

2), the output data of Map are processed, the reduce task is obtained the subregion raw data of each map task output; Metadata after gathering is submitted to the repartition device, adopts genetic algorithm to carry out equilibrium to metadata, and genetic algorithm is that bit string is operated, and its concrete steps are as follows:

2.1, metadata collecting that Map is exported data gets up to be placed in a set, as a population, each element in population is encoded, each element of coded representation that so-called coding uses " 0,1 " to form exactly, the coded system that the present invention adopts is to represent the element place subscript in gathering with 1 number, this population is carried out random division, be divided into the N subset, wherein N is corresponding with the number of reduce, division each time forms a gene, after repeatedly dividing, form a genome;

2.2, fitness function is to weigh the individual adaptedness for living environment of heredity in genetic algorithm, the individuality that fitness is higher obtains more duplicator meeting, vice versa, therefore, defines a fitness function

$\min {Σ_{j = 1}^{n} | S_{j} - S |} / n$ Formula (1)

, wherein,
Be whole mean value of the element sum of subsets, what in formula (1), objective function was described is the mean distance that each subset is incorporated into mean value, utilize this formula (1), each gene is calculated its fitness function, form a new set, then obtain the probability of each Gene sufficiency function, namely the value of the fitness function of a gene is divided by whole genomic fitness function value sum;

2.3, will select operator to be applied to genome, the selection operator that adopts is roulette wheel selection, utilize random function to produce one [0,1] random number between, judge the position in its fitness probability sequence in genome, if its multipotency represents that greater than m in sequence value the m gene is selected, freely specify the number that needs the gene selected;

2.4, carry out crossing operation to electing gene, namely the part-structure of Fineness gene is replaced to reconfigure and formed new gene, adopt the single-point crossover operator, concrete operations are: set at random a point of crossing, the gene that corresponding roulette selection algorithm chooses, intersect, namely the part-structure of two genes before and after this point of crossing exchanges, and generate two new individual, and the genome after guaranteeing to exchange can not have the situation of null set, set a nullGen sign, genome after traversal is intersected, if find to have null set to exist, be about to the nullGen sign and be set to false, and identify the gene of this deletion with this,

2.5, to the computing that makes a variation of the gene after intersecting, thereby the variation computing is according to the variation probability, some gene in genome to be replaced with other gene to form a new individuality, adopt the fixed bit mutation operator, and the probability that will make a variation is made as 0.1, to obtaining optimum solution, the fixed bit mutation operator refers to a certain position or a few genes of the fixing appointment of individual gene are made mutation operation: original gene is 0, become 1, original gene is 1, become 0, through after mutation operation, gene after variation is carried out non-NULLCHECK, guarantee that the gene after compiling still has the N subset,

2.6, abovely described one and taken turns evolutionary process, after evolving, too much wheel selects the gene of reservation according to elite's retention strategy, the gene retention strategy that adopts is: through after above step, calculate the target function value of each gene, and it is compared with the target function value of all genes in genome, the former is remained less than the latter's gene;

2.7, the gene that remains is decoded, just can obtain the combination to an optimization of metadata, be about to metadata and be divided into N the subset that size is substantially suitable, then, on the data allocations to that every subset is a corresponding reducer, so just guarantee that the handled data volume of each reducer is suitable.