CN102937918A

CN102937918A - Data block balancing method in operation process of HDFS (Hadoop Distributed File System)

Info

Publication number: CN102937918A
Application number: CN2012103931769A
Authority: CN
Inventors: 曹海军; 伍卫国; 董小社; 樊源泉; 魏伟; 朱霍
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2012-10-16
Filing date: 2012-10-16
Publication date: 2013-02-20
Anticipated expiration: 2032-10-16
Also published as: CN102937918B

Abstract

The invention discloses a data block balancing method in an operation process of an HDFS (Hadoop Distributed File System). The method comprises the following steps of: at first, pre-processing local task lists of nodes, and dividing the local task list of each node into entirely local tasks and non-entirely local tasks, so as to provide the basis for starting data block balance judgment of the HDFS; secondly, carrying out estimation and task request prediction on an operation rate of each node; thirdly, designing and realizing an assignment process of each node after completing said steps; fourthly, selecting proper nodes to move a data block between the proper nodes, so that the distribution of the data block can be matched with a predicted node task request sequence; and finally, balancing the data block. With the adoption of the data block balancing method, non-local map task execution which is possible to occur is judged by predicting the node task request in advance, and the proper data block is moved between the corresponding nodes, so that the distribution response of the local map tasks can be obtained when the nodes send an actual task request. Therefore, the completion efficiency of a Map step can be improved.

Description

A kind of HDFS runtime data piece balance method

Technical field

The invention belongs to field of computer technology, relate to a kind of data block balance method, HDFS(Hadoop Distributed File System under especially a kind of cloud computing environment) data block balance method in operational process.

Background technology

Hadoop is by Apache increase income high reliability of organization development and storage and the distributed paralleling calculation platform of enhanced scalability, develop as the basic platform of the search engine project Nutch that increases income the earliest, independent from the Nutch project afterwards, become one of the cloud computing platform of typically increasing income.The Hadoop core has realized by the distributed file system (Hadoop Distributed File System, HDFS) of piece storage and the MapReduce computation model that is used for Distributed Calculation.HDFS provides the storage system that is comprised of numerous nodes for the Hadoop cluster, when storage large-scale data file, file can be cut into the data block (last piece of data piece exception) of a plurality of formed objects, is distributed in the cluster on all nodes.In order to guarantee reliability, HDFS can create a plurality of copies according to being configured to each piece of data piece, and is placed on the different nodes of cluster.HDFS provides data storage service for upper strata MapReduce computing engines.Hadoop MapReduce is divided into many little tasks in parallel with application and carries out, and each little task is just processed the data block of the local storage of computing node.

The HDFS file system adopts piecemeal mechanism distributed storage data set, and improve system reliability by the data block redundancy strategy, each data block has a plurality of copies to exist simultaneously in system, these copies are distributed on a plurality of nodes in a plurality of frames in the system, prevent from causing losing of data block because individual node breaks down.In addition, this distributed redundancy scheme can guarantee that the concurrent of file read, so that HDFS is more suitable for the data processing mode of " once write, repeatedly read ".For realizing this data block redundancy strategy, the HDFS file system must ensure a plurality of copies and write simultaneously when data writing.

The HDFS file system needs to obtain a plurality of node configuration node pipelines by NameNode first when data writing flows, and when first node in the data stream arrival pipeline, this node storage Data Concurrent is given second node in the pipeline.Same, second node storage Data Concurrent given the 3rd node in the pipeline ... by that analogy, finish writing of multiple copies.

The HDFS file system when placing data block and copy thereof, consider following some:

1) when the node of submitting data to also is the node of storage data block in the HDFS file system, places the backup of a data block on this node;

2) backup of a data block must be distributed on a plurality of frames, avoids single frame fault to cause whole data unavailable;

3) be in the backup that data block also must be arranged on other interior nodes of same frame with the submission back end, can reduce like this communication and IO expense between frame as far as possible;

4) under the prerequisite that satisfies the front condition, take into account the utilization rate of considering node storage space, guarantee as far as possible each node storage utilization rate balance.

The Hadoop Map stage is the phase one of whole MapReduce Job execution, mainly finishes outer input data is converted into＜Key Value〉intermediate data of form, offer the follow-up Reduce stage as the input data.Under the distributed parallel processing environment, the Hadoop Map stage uses distributed file system HDFS as the input data source, and under the governing principle of " mobile computing is more more economical than Mobile data ", user's Map processing procedure of appointment when the submit job is assigned to each HDFS data block store node carries out.The required input data of the processing procedure that is assigned with when certain node just when this node is stored, claim this processing procedure to satisfy the data locality.

Hadoop MapReduce has avoided the problem of a plurality of data block copy re-treatments by node task requests distribution mechanism.But from the analysis of Hadoop Map stage implementation as can be known, the locality of Map task input data also can produce a very large impact the execution speed of Map task.When Map input data will be saved data block Internet Transmission expense, raising Map tasks carrying speed when the Map tasks carrying is on same node.In existing Hadoop architecture, the distribution of HDFS data block copy directly affects the locality of Map task input data by the Hadoop task dispatcher.

Therefore, although existing HDFS data block Placement Strategy can guarantee roughly balance of each internodal data block quantity, but because the irrational distribution of some data block copy, after causing some node " to steal " the local Map task of other nodes, other nodes are because local Map task is assigned with same needs " task stealing ", this " task allocation offsets " phenomenon has further strengthened non-local data transfer amount of Map stage, bring huge transmission pressure for whole network, affect the operational efficiency of all stage.In addition, when between node during the number of data blocks balance, node task processing speed difference also can cause occurring largely non-localization task and process.

Summary of the invention

The object of the invention is to solve the lower problem of Map stage map task data locality that causes owing to HDFS data block skewness, a kind of HDFS runtime data piece balance method is provided, the method proposes to move the HDFS equilibrium strategy based on the runtime data piece, judge in advance the non-local map tasks carrying that may occur by prediction node task requests, and between corresponding node, move suitable data block, so that node can access the assignment response of local map task when sending the actual task request, thereby improve the efficient of finishing in Map stage.

The objective of the invention is to solve by the following technical programs:

This HDFS runtime data piece balance method may further comprise the steps:

1) the local task list pre-service of node

1.1 propose complete local task and non-complete local task: when there is a plurality of copy in each data block of HDFS, cause same task can appear in the local Map task list of different nodes, thereby remaining map number of tasks n in the local task list of certain node means that it is n that this node can distribute the local number of tasks of execution;

1.2 the preprocessing process of the local task list of node: when each node sends task requests successively, obtain from the local task list of node in the complete local task list that current executable task joins node, not being assigned with in the local task list of task then joins in the non-complete local task list;

Information Statistics when 2) node moves

Realize by design NodeEvaluateInfo class: treated total cost consuming time of data block total number sum, node reduced data piece of statistics node and the implementation progress tip of operation in such, know computing node average block processing time cost/sum after the above information, the current operation task of node excess time (1-tip)/(cost/sum);

3) assessment of node speed and task requests sequence prediction

3.1 the assessment of node speed: by step 2), adopt COST _i/ NUM _iThe data processing rate that represents each node, namely the node processing individual task is on average consuming time; Wherein, NUM _iBe the completed local map number of tasks of a certain moment node i, COST _iFor processing always consuming time that these local tasks spend;

3.2 the prediction of system task request sequence: till when the system task request sequence namely begins to finish to operation from current time, each moment sequence from node to host node application tasks carrying; At T ₀Constantly, the progress of the positive Processing tasks of node i is P _i, assess the node processing individual data piece T of being average consuming time that formula obtains by front speed _i, the K subtask request time point t of this node then _IkFor

T ₀+(1-P _i)×T _i+(k-1)×T _i k≥1；

Wherein k represents to count the request of this node k subtask from current time; After obtaining the task requests sequence of each node, determine in the following way the system task request sequence: note system spare number of tasks is m, and nodes is n in the system, to each node i, gets it and counts the time point of m subtask request from current time, is designated as { t _I1, t _I2... t _Im, n node will consist of n * m time point { t ₁₁, t ₁₂... t _1m, t ₂₁, t ₂₂... t _2m..., t _N1, t _N2... t _Nm; All time points are arranged by ascending order, got front m, then can obtain beginning remaining the request sequence R of m task the system from current time _m.R _m(j)=t _IkShow that namely j task requests will be by node i at t in the system _IkConstantly send, and this request is k request of node i;

4) distribution analysis and the realization of node task: the task distribution condition of under the node request sequence of step 3) prediction, determining in advance each node;

5) the right selection of data block mobile node: the node that from the task requests sequence, obtains the request of sending, then from the local task list of this node, obtain task, if task is empty, assert that then this node is the ready to balance node, joins it in ready to balance node listing; The data block mobile node is traversal allocate array to the first step of selection course, makes up mapping table Map＜node, List＜Task〉〉, record all the unallocated tasks on all data block source nodes;

6) movement of data block between node

Determine and just can carry out actual data block behind ready to balance node and the data block source node and move; May there be a plurality of data blocks to need to move because data block moves separate with the node tasks carrying and considers, for raising the efficiency and simplifying programming and realize, adopt the JAVA Thread Pool Technology to realize that whole data block moves.

Further, in the above step 4), simulation Hadoop scheduler is at the system task request sequence R of current prediction _mUnder response process; According to request opportunity and system's current task distribution condition of each node, determine to the task assignment response of this request and judge that this task is distributed whether satisfy the task locality; The assignment record that sets the tasks: realize by the AllocatedRecord class whether data block corresponding to the time of such assignment flag by logger task, the node serial number of distributing to, distribution and this task has added generation exchange tabulation; Node task requests record, record send the node of this request and this request at the order of this node all are asked from current time backward; At last according to the system task request sequence R that determines in the step 3) _m, j request R wherein _m(j)=t _IkBy node i at t _IkConstantly send, and this request is k request of node i; By traversal task requests sequence R _m, to j the task requests that occurs in the system, k task task from the local Map task list of node i _{(i, k)}First schedulable local Map task is searched in beginning; Judge that schedulable local Map task is according to being:

7) task _{(i, k}) be not empty;

8) allocate[task _{(i, k)}.id]==-1, namely this task device that is not scheduled is distributed to other node;

Work as task _{(i, k+m|m 〉=0)}When satisfying the task locality, the assignment record allocate[task of corresponding task is set _{(i, k)}.id]=and i, finish the analysis to j task requests; Otherwise work as task _{(i, k+m|m 〉=0)}Be not empty, with task _{(i, k+m|m 〉=0)}Add in the commutative task queue of this node, judge next local Map task task _{(i, k+m+1|m 〉=0)}Work as task _{(i, k+m|m 〉=0)}During for sky, make up ready to balance node object BalanceNode and be recorded in the ready to balance node listing according to node i and switching task formation thereof.

Further, above-mentioned steps 5) detailed process is: for each unappropriated task task, obtain its data block copy memory node collection, node and task that the data block copy memory node is concentrated are pressed＜Node, List＜Task〉〉 form put into mapping table, same node is at List＜Task〉afterbody adds task; After obtaining all data block source nodes, for each the ready to balance node in the ready to balance node listing, by the Ergodic Maps table, find first and ready to balance node to be positioned at the data block source node of same frame; The request of moving of construction data piece is also submitted to; Judge that foundation that two nodes are positioned at same frame is that the two node name prefix is consistent; When finding when being positioned at the data block source node of same frame with the ready to balance node, select first data block source node in the mapping table.

The invention has the beneficial effects as follows:

The present invention is directed to the difference of Hadoop operation different node processing data blocks in Map stage running process, so that distributing, data block more meets each joint behavior by the Mobile data piece, not only can reduce follow-up operation non-local Map task of similar operation on this data set distributes, improve Map task locality, promote each node to carry out balance at the Map phased mission, and can improve the task balance of current running job in Map stage subsequent process.In the Hadoop Map stage, the execution of each Map task is fully independently to each other.Each Map task only needs to obtain data from local disk when processing local Map task, except to JobTracker node report self processing progress almost without any need for network service.Therefore, when the local Map task of node processing, the Mobile data piece is less to the pressure that whole network causes.

Description of drawings

Fig. 1 is that the node task is distributed analytic process class figure;

Fig. 2 is that node appoints me to distribute analysis process figure;

Fig. 3 is ready to balance node and task matching figure to be allocated;

Fig. 4 is that the data block mobile node is to the coupling process flow diagram;

Fig. 5 is that data block moves the thread pool framework;

Fig. 6 is that data block moves thread pool class figure;

Fig. 7 is that data block moves between node.

Embodiment

Below in conjunction with accompanying drawing the present invention is described in detail.

Based on the HDFS data block equilibrium strategy that the runtime data piece moves, its specific implementation step is as follows:

The first step, the local task list pre-service of node.Local task list to each node carries out pre-service, and it is divided into complete local task part and non-complete local task part.The complete local task of all nodes has partly realized a complete process to input data set, and does not have to each other task to occur simultaneously.Ideally, if each node distributes complete local task simultaneously, then the distribution of HDFS data block is to meet scheduler to the distribution of each node, i.e. HDFS data block placement is balanced.Can determine that Conflict Tasks distributes this moment by the task requests in prediction node future, and judge that with this non-local task that may occur distributes, for the placement of balance HDFS data block provides reliable foundation.Simulation JobTracker dispatching distribution task process is adopted in the pre-service of the local task list of node.Each node task processing speed is identical, when each node will send task requests successively, the request of preprocessing process responsive node, obtain from the local task list of node in the complete local task list that current executable task joins node, not being assigned with in the local task list of task then joins in the non-complete local task list.Executable local task i.e. this task does not also have at other copy memory nodes and carries out.

Second step, Information Statistics during the node operation.The present invention represents in system design adopted NodeEvaluateInfo class, treated total cost consuming time of data block total number sum, node reduced data piece of statistics node and the implementation progress tip of operation in such, knowing after the above information can computing node average block processing time cost/sum, the current operation task of node excess time (1-tip)/(cost/sum);

The 3rd step, the assessment of node speed and task requests sequence prediction.1) node speed assessment.The present invention proposes a kind of node speed evaluation scheme, utilizes the node statistical information of collecting previously, adopts COST _i/ NUM _iThe data processing rate that represents each node, namely the node processing individual task is on average consuming time.Wherein, NUM _iBe the completed local map number of tasks of a certain moment node i, COST _iFor processing always consuming time that these local tasks spend; 2) system task request sequence prediction.Till when the system task request sequence namely begins to finish to operation from current time, each moment sequence from node to host node application tasks carrying.Theoretically, this sequence only just can accurately be known after operation is finished, and can predict based on the progress of node data processing speed and the current Processing tasks of node in the job run process; Suppose at T ₀Constantly, the progress of the positive Processing tasks of node i is P _i, assess the node processing individual data piece T of being average consuming time that formula obtains by front speed _i, the K subtask request time point t of this node then _IkFor

T ₀+(1-P _i)×T _i+(k-1)×T _i k≥1

Wherein k represents to count the request of this node k subtask from current time.

After obtaining the task requests sequence of each node, can determine in the following way the system task request sequence.

Note system spare number of tasks is m, and nodes is n in the system, to each node i, gets it and counts the time point of m subtask request from current time, is designated as { t _I1, t _I2... t _Im, n node will consist of n * m time point { t ₁₁, t ₁₂... t _1m, t ₂₁, t ₂₂... t _2m..., t _N1, t _N2... t _Nm.All time points are arranged by ascending order, got front m, then can obtain beginning remaining the request sequence R of m task the system from current time _m.R _m(j)=t _IkShow that namely j task requests will be by node i at t in the system _IkConstantly send, and this request is k request of node i.

The formalized description of this process is: known integer sequence A={a ₁, a ₂... a _n, B={b ₁, b ₂... b _n, structure integer set C={c _i| c _i=a _i+ k*b _i, k 〉=o} asks front m of ascending order arrangement C ordinal number is arranged.

In the 4th step, the node task is distributed analysis.

1) node task assigning process design.It is in fact that simulation Hadoop scheduler is at the system task request sequence R of current prediction that the node task is distributed analytic process _mUnder response process.According to request opportunity and system's current task distribution condition of each node, determine to the task assignment response of this request and judge that this task is distributed whether satisfy the task locality.When the task requests of certain node can not obtain local Map task assignment response, this node was the ready to balance node.The node task is distributed associated class figure as shown in Figure 1.Node task requests record NodeRequest.It is the response process of simulation Hadoop task dispatcher under the system task request sequence of current prediction that the node task is distributed analysis, NodeRequest then describes a node task requests in the simulation process, wherein essential record send the node of this request, and should request at the order of this node from backward all requests of current time.

2) node task assigning process is realized.It is that the task of determining in advance each node under the node request sequence of prediction is distributed situation, its idiographic flow such as mistake that the node task is distributed analysis! Do not find Reference source.Shown in 2.According to the description of front, system task request sequence R _mIn j the request R _m(j)=t _IkBy node i at t _IkConstantly send, and this request is k request of node i.By traversal task requests sequence R _m, to j the task requests that occurs in the system, k task task from the local Map task list of node i _{(i, k)}First schedulable local Map task is searched in beginning.Judge that schedulable local Map task is according to being:

(1) task _{(i, k)}Be not empty;

(2) allocate[task _{(i, k)}.id]==-1, namely this task device that is not scheduled is distributed to other node.Work as task _{(i, k+m|m 〉=0)}When satisfying the task locality, the assignment record allocate[task of corresponding task is set _{(i, k)}.id]=and i, finish the analysis to j task requests.Otherwise work as task _{(i, k+m|m 〉=} ₀₎Be not empty, with task _{(i, k+m|m 〉=0)}Add in the commutative task queue of this node, judge next local Map task task _{(i, k+m+1|m 〉=0)}Work as task _{(i, k+m|m 〉=0)}During for sky, make up ready to balance node object BalanceNode and be recorded in the ready to balance node listing according to node i and switching task formation thereof.After finishing the analysis of whole task requests, unbalanced if the HDFS data block is placed, then part task assignment record still is-1 in the not empty and allocate array of ready to balance node listing.This moment, ready to balance node number was identical with unallocated item number in the allocate array, and unappropriated task is not the local Map task on any one ready to balance node, was assigned with this task otherwise must have node in the node task assigning process in front.

In the 5th step, the data block mobile node is to selecting.Just can be between the memory node at ready to balance node and unallocated task input block place after finishing the node task and distribute analyzing the Mobile data piece.The process of specified data piece source node and Mobile data piece as shown in Figure 3.When selecting coupling ready to balance node and unallocated task, to retrain to some extent and limit.(1) switching node is to select in the memory node of a plurality of copy data pieces of unallocated task.For reducing communication overhead, the preferential selection and the copy memory node of ready to balance node in same frame; (2) should avoid as much as possible carrying out transmission of data blocks between a plurality of ready to balance nodes and same data block store node.

For accelerating internodal matching process, the present invention adopts greedy algorithm, at first parse the data block store node of all unallocated tasks, between ready to balance node set and memory node collection, search subsequently possible coupling combination, treat the constraint condition that satisfies the front between balance node and the unallocated task data piece memory node in case find certain, just determine the matching relationship of the two, no longer search other possible more excellent matching results.The algorithm time complexity of this process is O (N), and wherein N is the ready to balance nodes.Concrete node matching process such as a mistake! Do not find Reference source.Shown in.

The data block mobile node is traversal allocate array to the first step of selection course, makes up mapping table Map＜node, List＜Task〉〉, record all the unallocated tasks on all data block source nodes.Detailed process is for each unappropriated task task, obtain its data block copy memory node collection moveableNodes, node among the movableNodes and task are pressed＜Node, List＜Task〉〉 form put into mapping table nodeToTasks, same node is at List＜Task〉afterbody adds task.

In realization, select java.util.LinkedHashMap as the type of nodeToTasks.Such basic characteristics are according to access order the key-value pair in the mapping table to be carried out iteration, and behind the some key-value pairs of access, this key-value pair will be put into the afterbody of chained list.Using LinkedHashMap can effectively avoid carrying out data block between a plurality of ready to balance nodes and the same data source nodes moves.

After obtaining all data block source nodes, for each the ready to balance node in the ready to balance node listing, by traversal nodeToTasks, find first and ready to balance node to be positioned at the data block source node of same frame.The request of moving of construction data piece is also submitted to.Judge that foundation that two nodes are positioned at same frame is that the two node name prefix is consistent, is positioned at same frame such as node/rack-A/node01 and node/rack-A/node02.When finding when being positioned at the data block source node of same frame with the ready to balance node, select first data block source node among the nodeToTask.

In the 6th step, data block moves between node.Determine and just can carry out actual data block behind ready to balance node and the data block source node and move.Because moving separate with the node tasks carrying and consider, data block may have a plurality of data blocks to need to move, for raising the efficiency and simplifying programming and realize, adopt the JAVA Thread Pool Technology to realize the data block mobile module, such as mistake! Do not find Reference source.Shown in.Each data block moves the execution of calling that task is waited for idle thread in the thread pool, and after tasks carrying finished, thread returned to thread pool, accepts the execution of next task.1) thread pool design.Thread pool associated class figure as shown in Figure 6; 2) data block moves task.The corresponding a data block of each task, according to the task assignment record of ready to balance node, the node that can parse the corresponding data piece and select index maximum in the local Map task list according to mission number is as destination node.Each data block request of moving is packaged into a MoveTask class, has wherein comprised mobile data block Block and data block source node and the destination node BalanceNode of needs.Data block moves task Transfer by realizing the java.lang.Runnable interface, realizes that in run () method data block moves logic.Its process flow diagram as shown in Figure 7.

Each data block is moved request, Tracnsfer therefrom parses data block object and data block source node and destination node, send data block displacement instruction OP_REPLACE_BLOCK to destination node, send the OP_COPY_BLOCK instruction and finish the transmission of data block to source node by destination node.After data block copies successfully, the copy of this data block on the destination node notice NameNode deletion source node.

Claims

1. a HDFS runtime data piece balance method is characterized in that, may further comprise the steps:

1) the local task list pre-service of node

Information Statistics when 2) node moves

3) assessment of node speed and task requests sequence prediction

3.2 the prediction of system task request sequence: till when the system task request sequence namely begins to finish to operation from current time, each moment sequence from node to host node application tasks carrying; At T ₀Constantly, the progress of the positive Processing tasks of node i is P _i, assess the node processing individual data piece T of being average consuming time that formula obtains by front speed _i, the K subtask request time point t of this node then _IkBe T ₀+ (1-P _i) * T _i+ (k-1) * T _iK 〉=1;

6) movement of data block between node

2. HDFS runtime data piece balance method according to claim 1 is characterized in that, in the step 4), simulation Hadoop scheduler is at the system task request sequence R of current prediction _mUnder response process; According to request opportunity and system's current task distribution condition of each node, determine to the task assignment response of this request and judge that this task is distributed whether satisfy the task locality; The assignment record that sets the tasks: realize by the AllocatedRecord class whether data block corresponding to the time of such assignment flag by logger task, the node serial number of distributing to, distribution and this task has added generation exchange tabulation; Node task requests record, record send the node of this request and this request at the order of this node all are asked from current time backward; At last according to the system task request sequence R that determines in the step 3) _m, j request R wherein _m(j)=t _IkBy node i at t _IkConstantly send, and this request is k request of node i; By traversal task requests sequence R _m, to j the task requests that occurs in the system, k task task from the local Map task list of node i _{(i, k)}First schedulable local Map task is searched in beginning; Judge that schedulable local Map task is according to being:

A) task _{(i, k)}Be not empty;

B) allocate[task _{(i, k)}.id]==-1, namely this task device that is not scheduled is distributed to other node;

3. HDFS runtime data piece balance method according to claim 1 and 2, it is characterized in that, the step 5) detailed process is: for each unappropriated task task, obtain its data block copy memory node collection, node and task that the data block copy memory node is concentrated are pressed＜Node, List＜Task〉〉 form put into mapping table, same node is at List＜Task〉afterbody adds task; After obtaining all data block source nodes, for each the ready to balance node in the ready to balance node listing, by the Ergodic Maps table, find first and ready to balance node to be positioned at the data block source node of same frame; The request of moving of construction data piece is also submitted to; Judge that foundation that two nodes are positioned at same frame is that the two node name prefix is consistent; When finding when being positioned at the data block source node of same frame with the ready to balance node, select first data block source node in the mapping table.