CN102937918B

CN102937918B - A kind of HDFS runtime data block balance method

Info

Publication number: CN102937918B
Application number: CN201210393176.9A
Authority: CN
Inventors: 曹海军; 伍卫国; 董小社; 樊源泉; 魏伟; 朱霍
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2012-10-16
Filing date: 2012-10-16
Publication date: 2016-03-30
Anticipated expiration: 2032-10-16
Also published as: CN102937918A

Abstract

The invention discloses a kind of HDFS runtime data block balance method, first the method carries out pre-service to the local task list of node, for each node division has gone out local task and the local task of non-fully, to provide the foundation starting HDFS data block balance and judge.Then the operating rate of each node is just assessed and task requests prediction.After completing above step, Design and implementation is carried out to the assigning process of each node.Select the movement carrying out data block between suitable node afterwards, thus the distribution of data block just can mate the node tasks request sequence of prediction, finally reach the object of data block balance.The present invention proposes to move HDFS equilibrium strategy based on runtime data block, the non-local map tasks carrying that may occur is judged in advance by the request of prediction node tasks, and data block mobile suitable between corresponding node, the assignment response of local map task can be obtained when making node send actual task request, thus improve the Map stage complete efficiency.

Description

A kind of HDFS runtime data block balance method

Technical field

The invention belongs to field of computer technology, relate to a kind of data block balance method, HDFS(HadoopDistributedFileSystem under especially a kind of cloud computing environment) data block balance method in operational process.

Background technology

Hadoop to be increased income a high reliability of organization development and the storage of enhanced scalability and distributed paralleling calculation platform by Apache, basic platform the earliest as the search engine project Nutch that increases income is developed, independent from Nutch project afterwards, become one of cloud computing platform of typically increasing income.Hadoop core achieves by the distributed file system (HadoopDistributedFileSystem, HDFS) of block storage and the MapReduce computation module for Distributed Calculation.HDFS provides the storage system be made up of numerous node for Hadoop cluster, when storing large-scale data file, file can be cut into the data block (last part of data block exception) of multiple formed objects, and distribution is in the cluster on all nodes.In order to ensure reliability, HDFS can create multiple copy according to being configured to each part of data block, and is placed on the different nodes of cluster.HDFS provides data storage service for upper strata MapReduce computing engines.Application is divided into many little tasks in parallel and performs by HadoopMapReduce, and each little task just processes the data block that computing node this locality stores.

HDFS file system adopts piecemeal mechanism distributed storage data set, and improve system reliability by data block redundancy strategy, each data block has multiple copy to exist in systems in which simultaneously, on multiple nodes of these replica allocation in systems in which in multiple frame, prevent from causing the loss of data block because individual node breaks down.In addition, this distributed redundancy scheme can ensure the concurrent reading of file, makes HDFS be more suitable for the data processing mode of " once write, repeatedly read ".For realizing this data block redundancy strategy, HDFS file system must ensure multiple copy when writing data and write simultaneously.

HDFS file system needs when writing data stream first to obtain multiple node configuration node pipeline by NameNode, and when data stream arrives first node in pipeline, this node storage Data Concurrent gives second node in pipeline.Same, second node stores Data Concurrent and gives the 3rd node in pipeline ... by that analogy, the write of multiple copies is completed.

HDFS file system place data block and copy time consider following some:

1) when the node submitting data to is also the node storing data block in HDFS file system, this node is placed the backup of a data block;

2) backup of a data block must be distributed in multiple frame, avoids single frame fault to cause whole data unavailable;

3) be in the backup that other nodes in same frame also must have data block with submission back end, the communication between frame and IO expense can be reduced so as far as possible;

4) under the prerequisite meeting condition above, take into account the utilization rate considering node storage space, ensure that each node stores utilization rate balance as far as possible.

The HadoopMap stage is first stage of whole MapReduce Job execution, mainly complete and outer input data is converted into <Key, the intermediate data of Value> form, is supplied to the follow-up Reduce stage as input data.Under distributed variable-frequencypump environment, the HadoopMap stage uses distributed file system HDFS as input data source, and under the governing principle of " mobile computing is more more economical than Mobile data ", the Map processing procedure of user being specified when submit job is assigned on each HDFS data block memory node and performs.When the input data needed for the processing procedure that certain node is assigned with just store on this node, this processing procedure is claimed to meet data locality.

HadoopMapReduce avoids the problem of multiple data block copy re-treatment by node tasks request dispatching mechanism.But from the analysis of HadoopMap stage implementation, the execution speed of locality to Map task of Map task input data also can produce a very large impact.Data block Internet Transmission expense will be saved when Map input data and Map tasks carrying are on same node, improve Map tasks carrying speed.In existing Hadoop architecture, the distribution of HDFS data block copy directly affects the locality of Map task input data by Hadoop task dispatcher.

Therefore, although existing HDFS data block Placement Strategy can ensure that each internodal data block quantity roughly balances, but due to the irrational distribution of some data block copy, after causing some node " to steal " the local Map task of other nodes, other nodes are assigned with same needs " task stealing " due to local Map task, this " task matching skew " phenomenon increases the transmission quantity of Map stage non-local data further, bring huge transmission pressure to whole network, affect the operational efficiency of all stage.In addition, when internodal data block number balances, node tasks processing speed difference also can cause occurring non-localized task process largely.

Summary of the invention

The object of the invention is to solve the lower problem of the Map stage map task data locality that causes due to HDFS data block skewness, a kind of HDFS runtime data block balance method is provided, the method proposes to move HDFS equilibrium strategy based on runtime data block, the non-local map tasks carrying that may occur is judged in advance by the request of prediction node tasks, and data block mobile suitable between corresponding node, the assignment response of local map task can be obtained when making node send actual task request, thus improve the Map stage complete efficiency.

The object of the invention is to solve by the following technical programs:

This HDFS runtime data block balance method, comprises the following steps:

1) the local task list pre-service of node

1.1 propose complete local task and the local task of non-fully: when each data block of HDFS exists multiple copy time, same task is caused to appear in the local Map task list of different node, thus remaining map number of tasks n in the local task list of certain node, mean that the local number of tasks that this node can distribute execution is n;

The preprocessing process of the local task list of 1.2 nodes: when each node sends task requests successively, from the local task list of node, obtain current executable task joins in the complete local task list of node, and not being assigned with in local task list of task then joins in the local task list of non-fully;

2) Information Statistics when node runs

Realized by design NodeEvaluateInfo class: the implementation progress tip adding up the treated data block total number sum of node, the total cost consuming time of node reduced data block and operation in such, know computing node average block processing time cost/sum, node current operation task excess time (1-tip)/(cost/sum) after above information;

3) node rate assessment and task requests sequence prediction

The assessment of 3.1 node rate: by step 2), adopt COST _i/ NUM _irepresent the data processing rate of each node, namely node processing individual task is on average consuming time; Wherein, NUM _ifor the completed local map number of tasks of a certain moment node i, COST _ialways consuming time for what process that these local tasks spend;

The prediction of 3.2 system task request sequences: when namely system task request sequence completes to operation from current time, each is from node to the moment sequence of host node application tasks carrying; At T ₀in the moment, the progress of the positive Processing tasks of node i is P _i, the node processing individual data block obtained by rate evaluation formula is above on average consuming time is T _i, then the K subtask request time point t of this node _ikfor

T ₀+(1-P _i)×T _i+(k-1)×T _ik≥1；

Wherein k represents from current time and counts this node kth subtask request; After obtaining the task requests sequence of each node, certainty annuity task requests sequence in the following way: note system spare number of tasks is m, and system interior joint number is n, to each node i, gets it counts the request of m subtask time point from current time, is designated as { t _i1, t _i2... t _im, n node will form n × m time point { t ₁₁, t ₁₂... t _1m, t ₂₁, t ₂₂... t _2m..., t _n1, t _n2... t _nm; By all time points by ascending order arrangement, get front m, then can obtain from current time the request sequence R remaining m task in system _m.R _m(j)=t _iknamely show that in system, a jth task requests will by node i at t _ikmoment sends, and this request is a kth request of node i;

4) the distribution analysis of node tasks and realization: the task matching situation determining each node under the node request sequence that step 3) is predicted in advance;

5) selection that data block mobile node is right: the node obtaining the request that sends from task requests sequence, then from the local task list of this node, task is obtained, if task is empty, then assert that this node is ready to balance node, joined in ready to balance node listing; The first step of data block mobile node to selection course is traversal allocate array, build mapping table Map<node, List<Task>>, records all unallocated task on all data block source nodes;

6) movement of internodal data block

Just can carry out actual data block after determining ready to balance node and data block source node to move; Perform separate with node tasks because data block moves and consider to have multiple data block to need to move, for raising the efficiency and simplifying programming realization, adopting JAVA Thread Pool Technology to realize whole data block and move.

Further, in above step 4), simulation Hadoop scheduler is at the system task request sequence R of current predictive _munder response process; According to request opportunity and the system current task distribution condition of each node, determine respond the task matching of this request and judge whether this task matching meets task locality; Determine the assignment record of task: realized by AllocatedRecord class, whether the time of such assignment flag by logger task, the node serial number distributed to, distribution and data block corresponding to this task have added generation exchanges list; Node tasks request record, record sends the node of this request and this request at this node from the order current time backward all requests; Last according to the system task request sequence R determined in step 3) _m, a jth request R wherein _m(j)=t _ikby node i at t _ikmoment sends, and this request is a kth request of node i; By traversal task requests sequence R _m, to the jth task requests occurred in system, a kth task task from the local Map task list of node i _{(i, k)}start, search first schedulable local Map task; Judge that schedulable local Map task is according to being:

7) task _{(i, k}) be not empty;

8) allocate [task _{(i, k)}.id]==-1, namely other node distributed to by this task device that is not scheduled;

Work as task _{(i, k+m|m>=0)}when meeting task locality, the assignment record allocate [task of corresponding task is set _{(i, k)}.id]=i, terminates the analysis to a jth task requests; Otherwise work as task _{(i, k+m|m>=0)}be not empty, by task _{(i, k+m|m>=0)}add in the commutative task queue of this node, judge next local Map task task _{(i, k+m+1|m>=0)}; Work as task _{(i, k+m|m>=0)}during for sky, build ready to balance node object BalanceNode according to node i and switching task queue thereof and be recorded in ready to balance node listing.

Further, above-mentioned steps 5) detailed process is: for each unappropriated task task, obtain its data block copy memory node collection, the node concentrate data block copy memory node and task press <Node, the form of List<Task>> puts into mapping table, and same node adds task at List<Task> afterbody; After all data block source nodes of acquisition, for each the ready to balance node in ready to balance node listing, by Ergodic Maps table, first is found to be positioned at the data block source node of same frame with ready to balance node; Construction data block moves asks and submits to; Judge that two nodes are positioned at the consistent according to the node name prefix being the two of same frame; When the data block source node being positioned at same frame with ready to balance node cannot be found, first data block source node in Choose for user table.

The invention has the beneficial effects as follows:

The present invention is directed to the difference of Hadoop operation different node processing data block in Map stage running process, by Mobile data block, data block is distributed and more meet each joint behavior, not only can reduce the non-local Map task matching of follow-up operation similar operation on this data set, improve Map task locality, promote that each node performs balance at Map phased mission, and the task balance of current running job in Map stage subsequent process can be improved.In the HadoopMap stage, the execution of each Map task is completely independently to each other.Each Map task process local Map task time, only need to obtain data from local disk, except except JobTracker node report self process progress almost without any need for network service.Therefore, when the local Map task of node processing, the pressure that Mobile data block causes whole network is less.

Accompanying drawing explanation

Fig. 1 is that node tasks distributes analytic process class figure;

Fig. 2 is that node appoints me to distribute analysis process figure;

Fig. 3 is that ready to balance node mates figure with task to be allocated;

Fig. 4 is that data block mobile node is to coupling process flow diagram;

Fig. 5 is that data block moves thread pool framework;

Fig. 6 is that data block moves thread pool class figure;

Fig. 7 is that internodal data block moves.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in detail.

Based on the HDFS data block equilibrium strategy of runtime data block movement, its specific implementation step is as follows:

The first step, the local task list pre-service of node.Pre-service is carried out to the local task list of each node, is divided into complete local task portion and the local task portion of non-fully.The complete local task portion of all nodes achieves a complete process to input data set, and does not have task to occur simultaneously to each other.Ideally, if each node distributes complete local task simultaneously, then the distribution of HDFS data block meets the distribution of scheduler to each node, and namely HDFS data block is placed is balanced.Task requests now by prediction node future determines that Conflict Tasks distributes, and judges the non-local task matching that may occur, for the placement of balance HDFS data block provides reliable foundation with this.Simulation JobTracker dispatching distribution task process is adopted to the pre-service of the local task list of node.Each node tasks processing speed is identical, when each node will send task requests successively, the request of preprocessing process responsive node, from the local task list of node, obtain current executable task joins in the complete local task list of node, and not being assigned with in local task list of task then joins in the local task list of non-fully.Executable local task i.e. this task does not also have and performs on other copy memory nodes.

Second step, Information Statistics when node runs.The present invention adopts NodeEvaluateInfo class to represent in system, in such, add up the implementation progress tip of the treated data block total number sum of node, the total cost consuming time of node reduced data block and operation, knowing after above information can computing node average block processing time cost/sum, node current operation task excess time (1-tip)/(cost/sum);

3rd step, node rate assessment and task requests sequence prediction.1) node rate assessment.The present invention proposes a kind of node rate evaluation scheme, the node statistics information that utilization is collected above, adopts COST _i/ NUM _irepresent the data processing rate of each node, namely node processing individual task is on average consuming time.Wherein, NUM _ifor the completed local map number of tasks of a certain moment node i, COST _ialways consuming time for what process that these local tasks spend; 2) system task request sequence prediction.When namely system task request sequence completes to operation from current time, each is from node to the moment sequence of host node application tasks carrying.Theoretically, this sequence only just can accurately be known after operation completes, and can predict in job run process based on the progress of node data processing speed and the current Processing tasks of node; Suppose at T ₀in the moment, the progress of the positive Processing tasks of node i is P _i, the node processing individual data block obtained by rate evaluation formula is above on average consuming time is T _i, then the K subtask request time point t of this node _ikfor

T ₀+(1-P _i)×T _i+(k-1)×T _ik≥1

Wherein k represents from current time and counts this node kth subtask request.

After obtaining the task requests sequence of each node, can certainty annuity task requests sequence in the following way.

Note system spare number of tasks is m, and system interior joint number is n, to each node i, gets it counts the request of m subtask time point from current time, is designated as { t _i1, t _i2... t _im, n node will form n × m time point { t ₁₁, t ₁₂... t _1m, t ₂₁, t ₂₂... t _2m..., t _n1, t _n2... t _nm.By all time points by ascending order arrangement, get front m, then can obtain from current time the request sequence R remaining m task in system _m.R _m(j)=t _iknamely show that in system, a jth task requests will by node i at t _ikmoment sends, and this request is a kth request of node i.

The formalized description of this process is: known integer sequence A={a ₁, a ₂... a _n, B={b ₁, b ₂... b _n, structure integer set C={c _i| c _i=a _i+ k*b _i, k>=o}, asks front m of ascending order arrangement C to have ordinal number.

4th step, node tasks distributes to be analyzed.

1) node tasks assigning process design.It is in fact the system task request sequence R of simulation Hadoop scheduler at current predictive that node tasks distributes analytic process _munder response process.According to request opportunity and the system current task distribution condition of each node, determine respond the task matching of this request and judge whether this task matching meets task locality.When the task requests of certain node can not obtain the response of local Map task matching, this node is ready to balance node.Node tasks distributes associated class figure as shown in Figure 1.Node tasks request record NodeRequest.It is the response process of simulation Hadoop task dispatcher under the system task request sequence of current predictive that node tasks distributes analysis, NodeRequest then describes a node tasks request in simulation process, wherein essential record sends the node of this request, and this request at this node from the order current time backward all requests.

2) node tasks assigning process realizes.It is the task matching situation determining each node under the node request sequence of prediction in advance that node tasks distributes analysis, and its idiographic flow is as mistake! Do not find Reference source.Shown in 2.According to description above, system task request sequence R _min jth request R _m(j)=t _ikby node i at t _ikmoment sends, and this request is a kth request of node i.By traversal task requests sequence R _m, to the jth task requests occurred in system, a kth task task from the local Map task list of node i _{(i, k)}start, search first schedulable local Map task.Judge that schedulable local Map task is according to being:

(1) task _{(i, k)}be not empty;

(2) allocate [task _{(i, k)}.id]==-1, namely other node distributed to by this task device that is not scheduled.Work as task _{(i, k+m|m>=0)}when meeting task locality, the assignment record allocate [task of corresponding task is set _{(i, k)}.id]=i, terminates the analysis to a jth task requests.Otherwise work as task _{(i, k+m|m>=} ₀₎be not empty, by task _{(i, k+m|m>=0)}add in the commutative task queue of this node, judge next local Map task task _{(i, k+m+1|m>=0)}.Work as task _{(i, k+m|m>=0)}during for sky, build ready to balance node object BalanceNode according to node i and switching task queue thereof and be recorded in ready to balance node listing.After completing the analysis of whole task requests, if HDFS data block place unbalanced, then ready to balance node listing not empty and in allocate array part task matching record be still-1.Now ready to balance node number is identical with item number unallocated in allocate array, and unappropriated task is not the local Map task on any one ready to balance node, otherwise must have node in node tasks assigning process above and be assigned with this task.

5th step, data block mobile node is to selection.Complete node tasks distribute analyze after just can between the memory node at ready to balance node and unallocated task input block place Mobile data block.Determine the process of data block source node and Mobile data block as shown in Figure 3.To retrain when selecting coupling ready to balance node and unallocated task to some extent and limit.(1) switching node selects in the memory node of multiple copy data blocks of unallocated task.For reducing communication overhead, prioritizing selection and the ready to balance node copy memory node in same frame; (2) should avoid as much as possible carrying out transmission of data blocks between multiple ready to balance node and same data block memory node.

For accelerating internodal matching process, the present invention adopts greedy algorithm, first the data block memory node of all unallocated tasks is parsed, possible coupling combination is searched subsequently between ready to balance node set and memory node collection, once find that certain treats satisfied constraint condition above between balance node and unallocated task data block memory node, just determine the matching relationship of the two, no longer search the more excellent matching result that other are possible.The Algorithms T-cbmplexity of this process is O (N), and wherein N is ready to balance nodes.Concrete node matching process is as mistake! Do not find Reference source.Shown in.

The first step of data block mobile node to selection course is traversal allocate array, build mapping table Map<node, List<Task>>, records all unallocated task on all data block source nodes.Detailed process is for each unappropriated task task, obtain its data block copy memory node collection moveableNodes, node in movableNodes and task are pressed <Node, the form of List<Task>> puts into mapping table nodeToTasks, and same node adds task at List<Task> afterbody.

Select java.util.LinkedHashMap as the type of nodeToTasks in the implementation.Such basic characteristics carry out iteration according to access order to the key-value pair in mapping table, and when after the some key-value pairs of access, this key-value pair will be put into the afterbody of chained list.Use LinkedHashMap effectively can avoid carrying out data block between multiple ready to balance node and same data source nodes to move.

After all data block source nodes of acquisition, for each the ready to balance node in ready to balance node listing, by traversal nodeToTasks, first is found to be positioned at the data block source node of same frame with ready to balance node.Construction data block moves asks and submits to.Judge that two nodes are positioned at the consistent according to the node name prefix being the two of same frame, as node/rack-A/node01 and node/rack-A/node02 is positioned at same frame.When the data block source node being positioned at same frame with ready to balance node cannot be found, select first data block source node in nodeToTask.

6th step, internodal data block moves.Just can carry out actual data block after determining ready to balance node and data block source node to move.Perform separate with node tasks because data block moves and consider to have multiple data block to need to move, for raising the efficiency and simplifying programming realization, adopting JAVA Thread Pool Technology to realize data block mobile module, as mistake! Do not find Reference source.Shown in.What each data block moved that task waits for idle thread in thread pool calls execution, and after tasks carrying terminates, thread returns to thread pool, accepts the execution of next task.1) thread pool design.Thread pool associated class figure as shown in Figure 6; 2) data block moves task.The corresponding a data block of each task, according to the task matching record of ready to balance node, can parse corresponding data block according to mission number and select node that in local Map task list, index is maximum as destination node.Each data block request of moving is packaged into a MoveTask class, wherein contains the data block Block and data block source node and destination node BalanceNode that need movement.Data block moves task Transfer by realizing java.lang.Runnable interface, realizes data block and move logic in run () method.Its process flow diagram as shown in Figure 7.

Request is moved to each data block, Tracnsfer therefrom parses data block object and data block source node and destination node, send data block replacement instruction OP_REPLACE_BLOCK to destination node, send OP_COPY_BLOCK instruction by destination node to source node and complete the transmission of data block.Data block copies successfully, and destination node notice NameNode deletes the copy of this data block on source node.

Claims

1. a HDFS runtime data block balance method, is characterized in that, comprise the following steps:

1) the local task list pre-service of node

2) Information Statistics when node runs

3) node rate assessment and task requests sequence prediction

The prediction of 3.2 system task request sequences: when namely system task request sequence completes to operation from current time, each is from node to the moment sequence of host node application tasks carrying; At T ₀in the moment, the progress of the positive Processing tasks of node i is P _i, the node processing individual data block obtained by rate evaluation formula is above on average consuming time is T _i, then the kth subtask request time point t of this node _ikfor T ₀+ (1-P _i) × T _i+ (k-1) × T _ik>=1;

Wherein k represents from current time and counts this node kth subtask request; After obtaining the task requests sequence of each node, certainty annuity task requests sequence in the following way: note system spare number of tasks is m, and system interior joint number is n ₁, to each node i, get it counts the request of m subtask time point from current time, be designated as { t _i1, t _i2... t _im, n ₁individual node will form n ₁× m time point { t ₁₁, t ₁₂... t _1m, t ₂₁, t ₂₂... t _2m..., t _n1, t _n2... t _nm; By all time points by ascending order arrangement, get front m, then can obtain from current time the request sequence R remaining m task in system _m, wherein R _m(j)=t _ikshow that in system, a jth task requests will by node i at t _ikmoment sends, and this request is a kth request of node i;

4) the distribution analysis of node tasks and realization: in step 3) determine the task matching situation of each node under the node request sequence predicted in advance;

6) movement of internodal data block

2. HDFS runtime data block balance method according to claim 1, is characterized in that, step 4) in, simulation Hadoop scheduler is at the system task request sequence R of current predictive _munder response process; According to request opportunity and the system current task distribution condition of each node, determine respond the task matching of this request and judge whether this task matching meets task locality; Determine the assignment record of task: realized by AllocatedRecord class, whether the time of such assignment flag by logger task, the node serial number distributed to, distribution and data block corresponding to this task have added generation exchanges list; Node tasks request record, record sends the node of this request and this request at this node from the order current time backward all requests; Last according to step 3) in the system task request sequence R that determines _m, a jth request R wherein _m(j)=t _ikby node i at t _ikmoment sends, and this request is a kth request of node i; By traversal task requests sequence R _m, to the jth task requests occurred in system, a kth task task from the local Map task list of node i _{(i, k)}start, search first schedulable local Map task; Judge that schedulable local Map task is according to being:

A) task _{(i, k)}be not empty;

B) allocate [task _{(i, k)}.id]==-1, namely other node distributed to by this task device that is not scheduled;

Work as task _{(i, k+m1|m1>=0)}when meeting task locality, the assignment record allocate [task of corresponding task is set _{(i, k)}.id]=i, terminates the analysis to a jth task requests; Otherwise work as task _{(i, k+m1|m1>=} ₀₎be not empty, by task _{(i, k+m1|m1>=0)}add in the commutative task queue of this node, judge next local Map task task _{(i, k+m1+1|m1>=0)}; Work as task _{(i, k+m1|m1>=0)}during for sky, build ready to balance node object BalanceNode according to node i and switching task queue thereof and be recorded in ready to balance node listing.

3. HDFS runtime data block balance method according to claim 1 and 2, it is characterized in that, step 5) detailed process is: for each unappropriated task task, obtain its data block copy memory node collection, the node concentrate data block copy memory node and task press <Node, the form of List<Task>> puts into mapping table, and same Node adds task at List<Task> afterbody; After all data block source nodes of acquisition, for each the ready to balance node in ready to balance node listing, by Ergodic Maps table, first is found to be positioned at the data block source node of same frame with ready to balance node; Construction data block moves asks and submits to; Judge that two nodes are positioned at the consistent according to the node name prefix being the two of same frame; When the data block source node being positioned at same frame with ready to balance node cannot be found, first data block source node in Choose for user table.