CN110262896A

CN110262896A - A kind of data processing accelerated method towards Spark system

Info

Publication number: CN110262896A
Application number: CN201910467553.0A
Authority: CN
Inventors: 赵来平; 李一鸣; 李克秋; 苏丽叶
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2019-09-20

Abstract

The present invention discloses a kind of data processing accelerated method towards Spark system, it is made of three performance prediction module, task scheduling modules and data distribution module parts, performance prediction module models the performance of a task according to given parameter and predicts its deadline；It is executed in task scheduling modules distribution calculating task to server, obtains currently available computing resource in real time by resource monitor, then execute on given server by the routine interface distribution task of exploitation；The present invention is directed to accelerate the data processing of Spark system, by comprehensively consider hardware resource isomerism, calculate interference, data locality, data skew and data overflow and the factors such as write, optimize data distribution and task schedule, so that minimum is performed integrally the time.

Description

A kind of data processing accelerated method towards Spark system

Technical field

The present invention relates to the big data technical field of the task schedule of distributed computing and data distribution, especially a kind of face To the data processing accelerated method of Spark system.

Background technique

With the proposition of MapReduce computation module, processing and analysis for big data become extremely simple and efficient, But the person's of falling behind problem is all the thorny problem in distributed computing all the time.The person of falling behind refers to that those appoint what is run parallel It spends abnormal more time to complete in business and then significantly reduces the task of overall performance.Lead to the person of falling behind problem Producing reason master It will be from two aspects of hardware computing resource layer and application layer, such as hardware isomerism, calculating interference, data locality, number It is write with data skew etc. according to overflowing.The whole deadline has not only been dragged slowly in the presence of the person's of falling behind problem, and task is enabled to become low efficiency Under, the waste of hardware resource is caused, being normally carried out for other tasks is influenced.Meanwhile the long-play of the person's of falling behind problem The risk for increasing task error even results in entire work failure.

Currently, there are many methods to be proposed to alleviate the person's of falling behind problem to accelerate data processing, such as LATE, Dolly, delay dispatching, DREAMS, LIBRA etc., but these methods are all haveed the defects that more or less, are not comprehensively considered Various influence factors are accelerated.LATE optimizes the supposition execution mechanism of default so that it can be adapted for Heterogeneous Computing ring Border, Dolly using avoiding waiting for and guess to execute progress to the complete clone of small task, but LATE and Dolly these lead to Backup tasks are crossed to alleviate the method for the person's of falling behind problem, requiring waiting, the short time does not believe to collect the statistics of mission performance performance Breath is to generate strategy, and still the very slow task of operation is also possible to cause the waste of computing resource after those backups.Delay Scheduling waits for improving whole data local sex expression by allowing those not to be able to satisfy the task of data locality, Calculation scale size dynamic of the DREAMS based on each task is that they distribute different amounts of computing resource, although both methods The influence of application layer data is considered, but still can not solve the problems, such as the data skew bring person of falling behind.LIBRA passes through It supports the fractionation of key-value pair data to carry out special optimization to data skew, but does not account for calculating interference and task schedule band The influence come.

Existing method is not applied for current most popular distributed computing framework Spark mostly, in order to make up theirs Deficiency, the invention proposes a kind of data processing accelerated methods towards Spark system.Compared to existing work, the present invention is comprehensive Conjunction considers two aspects of hardware computing resource layer and application layer, based on hardware resource isomerism, calculates interference, data locality The factor for influencing mission performance with data skew etc. establishes mission performance prediction model for the influence degree of performance, proposes A kind of model solution algorithm that can be completed within grade time second, and then data distribution and task scheduling strategy are formulated, it minimizes whole Body Runtime, accelerates task.

Summary of the invention

The present invention is directed to accelerate the data processing of Spark system, by comprehensively considering hardware resource isomerism, calculating and do Disturb, data locality, data skew and data overflow and the factors such as write, optimize data distribution and task schedule, to minimize entirety Deadline.

In order to solve prior art problem, the present invention adopts the following technical scheme:

A kind of data processing accelerated method towards Spark system, by performance prediction module, task scheduling modules and data Three parts of distribution module form.

A kind of data processing accelerated method towards Spark system, by performance prediction module, task scheduling modules and data Three parts of distribution module form,

When performance prediction module models the performance of a task and predicts its completion according to given parameter Between；

It is executed in task scheduling modules distribution calculating task to server, i.e., by hardware isomerism and data locality factor The scheduling strategy for carrying out appointed task obtains currently available computing resource, then the journey by exploitation by resource monitor in real time Sequence interface assignment task executes on given server；

Data distribution module is that task is generated to the intermediate data of (key, value) key-value pair format according to certain rule It is distributed to the process of different Reduce task processing；Wherein:

1) data distribution module uses based on bi-distribution to sampling of data and has recorded each key in each Map Estimated distribution in task；Whole distribution can be estimated with smaller cost by being sampled analysis for data, and then formulate number According to distribution policy；

2) data distribution module just distributes data using the form of record partitioning list；

Data distribution after sampling analysis is sorted by key, then evenly distributes them to each task, record by 2-1 Under each task pending data starting key and terminate key formed partition list BK；

2-2, the key that judgement is not belonging to BK can find the subregion that oneself will be assigned by comparison, it is only necessary to record Save the detailed distribution policy of key in BK；

2-3 determines that BK obtains scheduling result using task scheduling modules later and prepares for distribution data in next step；

3) data distribution module distributes data again；

3-1 according to data locality principle, first finds and the Map task part Map to be allocated of the key in BK The target Reduce task of operation on the same server is allocated；

3-2 is allocated remaining part according still further to first-fit algorithm；When being allocated every time, data distribution Module is according to the time upper limit t being arranged_maxAnd performance prediction module carries out data cutting, it is ensured that each Reduce task is used When be no more than t_max；

3-3 judges that all Reduce tasks are assigned and still have data redundancy, then increase t_maxRestart to distribute, so Iteration is until all key are assigned in BK.

The task scheduling modules are according to first scheduling Map task, then dispatch the progress of Reduce task, include the following steps:

1) the upper limit of the number of task can be distributed for every server determination, the upper limit by server performance indicator h itself and Total task number amount codetermines, the ratio that the ratio of the task quantity of each server is their h；

2) destination server for determining each task, is dispatched first, in accordance with data locality principle, and priority scheduling task arrives Pending data institute on the server, if it find that destination server predistribution task has reached the upper limit, then waits next round Scheduling；

Second wheel scheduling uses first-fit algorithm, and not yet dispatching for task is assigned to the clothes that the upper limit of the number has not yet been reached It is engaged on device；

3) computing resource unit is distributed, after the destination server for determining task, needs to distribute specific computing resource list Task is given in position.

Beneficial effect

Compared with prior art, the present invention has comprehensively considered the hardware resource layer and application layer for influencing distributed computing task Factor, compensate for the deficiency of the relevant technologies, can be used for accelerating the data processing of the Spark system under various scenes.

Detailed description of the invention

Fig. 1 is system architecture diagram of the invention.

Fig. 2 is the schematic diagram that data are just distributed.

Fig. 3 is the schematic diagram of task schedule.

Fig. 4 is the schematic diagram that data are distributed again.

Specific embodiment

For make the present invention introduced method and technology, reach purpose and effect be more readily understood it is clear, below with reference to specific Embodiment and attached drawing are further illustrated the present invention.

As shown in Figure 1, the present invention provides a kind of data processing accelerated method towards Spark system, by performance prediction mould Three block, task scheduling modules and data distribution module parts form.

1. performance prediction module

When performance prediction module models the performance of a task and predicts its completion according to given parameter Between.Performance prediction module is called by other modules, and is fed back to scheduling scheme, and the original of time is performed integrally according to minimum Then, scheduling scheme is constantly adjusted.Performance mainly includes following part:

1) data transmission period, i.e. task were read the time required for data to be processed.Because reading local data The time of (data storage and task run are on same server) spends can be ignored substantially, so when data are transmitted Between be essentially identical to task and read the time quantum t of partial data being stored on other servers_c, there is t_c=(s_s*s_e)/b_s, Middle s_sRepresent the number of data for needing remotely to read, s_eRepresent the size of every data, b_sIt is data transmission bauds.This part Time loss is the reflection of influence of the data locality for performance.

1-1) reading data number s_sIt is by data distribution strategy decision, this module feeds back data distribution module, The data volume for assisting adjustment task that will remotely read.

1-2) the size s of every data_eIt is to be determined by Stored Data Type and every data comprising data volume.

1-3) data transmission bauds b_sDifference according to data locality is different.Task read local data substantially without Time loss, and the data transmission bauds in the case that the task of secondary level-one is located at data with different server in rack depends on Performance in the case where the standard of interchanger, different racks is worst.

2) task computation time, i.e. CPU operation time.Task handles time t required for data_cBy the size of data with And processing speed determines, and data processing speed is mainly influenced by the factors such as server performance itself and calculating interference, institute To there is t_c=ξ/(h*f), wherein ξ represents task pending data amount, and h represents server process data under noiseless situation Speed, f represent the interference coefficient under different situations.This portion of time consumption combine hardware isomerism and calculate interference because Element.

2-1) pending data amount ξ is by data distribution strategy decision, this module feeds back data distribution module, Assist adjustment task by data volume to be processed

2-2) the data processing speed h of server is that its performance reflects under noiseless situation, the clothes of different hardware resource The computing capability of business device tested by benchmark test and obtained by difference, this parameter, can be used for solving the person of falling behind under isomerous environment Problem.

Interference 2-3) is calculated to refer to causing application performance to decline because of resource contention when multitask and machine operation.Either The task of different processes still operates in the task in different virtual machines, all can be because of preempting resources on hardware resource layer And compete with one another for, it can not even accomplish completely to be isolated at present, so the case where interference not can avoid, establishing interference model can be with The preferably performance under the operation of assessment multi-task parallel.Based on actual experiment as a result, this module establishes interference model f=c₀ +c₁* S/D, c₀And c₁It is the constant term coefficient in model, S and D are the CPU number that server can supply and task needs respectively Amount.F is smaller to mean that annoyance level is more serious.

2. task scheduling modules

The task schedule of MapReduce frame executes, the task schedule machine of default in distribution calculating task to server System pays the utmost attention to the data locality for meeting task, but is easy to produce considerable task in this way and is intensively allocated on certain server The case where, to aggravate resource contention, performance is caused to decline.The case where this mechanism does not consider hardware isomery simultaneously, exacerbates The person's of falling behind problem under complicated calculations cluster environment.This module synthesis considers hardware isomerism and data locality because usually specifying The scheduling strategy of task obtains currently available computing resource, then the routine interface by exploitation by resource monitor in real time Distribution task executes on given server, to realize more flexible scheduling, cooperates with other modules and alleviates the person's of falling behind problem. Specific step is as follows:

4) the upper limit of the number of task can be distributed for every server determination.The upper limit by server performance indicator h itself and Total task number amount codetermines, and the ratio that the ratio of the task quantity of each server is their h, i.e. performance is better, can distribute Task is more, can reduce in this way low performance server task it is excessive caused by the person's of falling behind problem.

5) destination server of each task is determined.It is dispatched first, in accordance with data locality principle, priority scheduling task arrives Pending data institute on the server, if it find that destination server predistribution task has reached the upper limit, then waits next round Scheduling.Second wheel scheduling uses first-fit algorithm, and not yet dispatching for task is assigned to the service that the upper limit of the number has not yet been reached On device.

6) computing resource unit is distributed.After the destination server for determining task, need to distribute specific computing resource list Task is given in position, such as the container in scheduling of resource frame YARN, encapsulates the resources such as a certain number of CPU, memory, disk for operation Task uses in this embodiment.

3. data distribution module

The shuffle stage of MapReduce frame is in (key, value) the key-value pair format for generating Map task Between data the process of different Reduce tasks processing is distributed to according to certain rule.Every data is according to key's under default situations Cryptographic Hash and general assignment number modulus determine target Reduce task, but the data that mechanism processing key distribution frequency differs greatly When, the case where excessive data are distributed to same task, i.e. data skew problem can be generated, this is the important of mission performance decline One of source.This module is based on performance prediction module, in conjunction with task schedule as a result, comprehensively considering influences each of performance Factor, constantly adjusts data distribution strategy, and final realize minimizes Reduce task total runtime, accelerate data processing. Specific step is as follows:

4) sampling of data.Whole distribution can be estimated with smaller cost by being sampled analysis for data, and then be formulated Data distribution strategy.This module uses the methods of sampling based on bi-distribution, and sampling is all that primary independent uncle exerts every time Benefit test, i.e., the probability whether each data is drawn is all independent and remains unchanged.In order to support more fine-grained number Estimated distribution of each key in each Map task is had recorded according to distribution and the needs of data transmission factors, this module.

5) data are just distributed.In order to avoid record each key detailed distribution policy great expense incurred, this module using note The form of partition list is recorded, the data distribution after sampling analysis sorts by key first, then evenly distributes them to each Task finally records the starting key of each task pending data and terminates key formation partition list BK.In this mechanism Under, the key for being not belonging to BK can find the subregion that oneself will be assigned by comparison, it is only necessary to which record saves the key in BK Detailed distribution policy.Determine that BK obtains scheduling result using task scheduling modules later and does standard for distribution data in next step It is standby.

6) data are distributed again.Dividing in detail for part of each key in each Map task in BK is formulated in detail in this part Hair strategy.For the part Map to be allocated of the key in BK, according to data locality principle, first find and the Map task run Target Reduce task on the same server is allocated, and remaining part is allocated according still further to first-fit algorithm. When being allocated every time, the present invention can be according to the time upper limit t being arranged_maxAnd performance prediction module carries out data cutting, really Each Reduce task used time is protected no more than t_max.If all Reduce tasks, which are assigned, still data redundancy, increase t_max Restart to distribute, such iteration is until all key are assigned in BK.

Its detailed step is as follows:

1. preparation work

Need first to obtain the parameters of performance prediction module before invention.

1) operation benchmark obtains the data processing performance h of each server in computing cluster.

2) change on the server that the number of tasks of operation simulates different degrees of calculating interference scenario simultaneously, obtain multiple groups H uses f=c in least square method scheduling algorithm fitting interference model₀+c₁* the c of S/D₀And c₁Parameter.

3) test obtains the data transmission bauds b in computing cluster between server_s。

2. sampling of data simultaneously determines partition list

Before Map task starts, pending data is sampled, count the distribution of key and Map task is estimated with this Distribution { the u of the intermediate result of output_k, wherein u_kKey is represented as the expected quantity of the data of k, while flag data will held Detailed distribution in capable each Map task.

After the completion of statistics, as shown in Fig. 2, according to key to { u_kIn element be ranked up, then data are sequentially allocated To each Reduce task, it is ensured that each task computation amount is identical (calculation amount is equal to data volume under normal circumstances), obtains record The list BK of the key of task partitioning boundary.Such as k2 and k4, data volume itself is larger, to avoid data skew problem, needs to cut And it is distributed to multiple adjacent R educe tasks and is handled.

3. data are just distributed

It, such as can be easy by that can determine the unique objects Reduce task of the key in non-BK with the boundary key comparison in BK Reduce2 can only be distributed to by obtaining k2.By data u_kAfter being assigned to corresponding Reduce task, it can be counted according to performance prediction module The expected time of each Reduce task after the just sub-distribution of calculating data, while counting being distributed as current each task data The teledata read volume s of next step task schedule and each task of statistics_sIt prepares.

4. task schedule

Map task is first dispatched, then dispatches Reduce task.

1) data locality.As shown in Fig. 3, the input of Map task for the first time is from HDFS distributed file system System, file are stored in the form of multiple block blocks of files in HDFS.Each Map task handles a block, each block There is identical backup on multiple servers, Map task run can meet highest on the server where corresponding backup The data locality of rank.Map task completes the intermediate data that output will be sent to each Reduce task, meets and minimizes The server that Reduce task remotely reads data can meet the maximum data locality of the task, at the beginning of the present invention selects data Partial data distribution after distribution count the prior node of determining Reduce task.

2) task threshold.Being calculated according to the ratio of the performance h of each available server ought to divide on each server More tasks are disposed on the better server of the number of tasks matched, i.e. performance.

3) task is distributed.As shown in figure 3, being preferentially assigned to firstly for each task i and meeting task the maximum data local Property server on run, if prior node is assigned number of tasks and has reached threshold value, task i waits next round scheduling (such as Task₃).Is assigned to by first by i and is looked into according to first-fit algorithm by remaining each task i after first round scheduling It is run on the not up to server of threshold value found.

The interference coefficient f that each server can be known after the completion of task schedule and each task after data just distribution Teledata read volume s_s。

5. data are distributed again

Process is as shown in Figure 4.

1) for any key k in BK, the set of the Map task comprising it is searched.

2) each Map task in traversal set, if there is target Reduce task i can meet highest data local Property, that is, it runs on the same server, then the intermediate result that this Map task generates preferentially preferentially is distributed into i.It is pre- according to performance Survey module and preassigned t_maxThe at most assignable data volume of current task i is calculated, then to the data of Map task Carry out cutting distribution.

3) unallocated complete Map task in traversal set is successively attempted to carry out all non-local target Reduce tasks Data distribution is no more than t according to each task scheduled time_maxPrinciple cutting data.

4) if the data of k illustrate current t there are also residue after distributing twice_maxIt is too small, it resets data and increases t_maxAfterwards Restart data distributing step again from step 1, iteration is until all k distribute and finish.

It should be pointed out that for those of ordinary skill in the art, without departing from the inventive concept of the premise, Various modifications and improvements can be made, and these are all within the scope of protection of the present invention.Therefore, the scope of protection of the patent of the present invention It should be determined by the appended claims.

Claims

1. a kind of data processing accelerated method towards Spark system, by performance prediction module, task scheduling modules and data point Three parts of module are sent out to form, it is characterised in that:

Performance prediction module models the performance of a task according to given parameter and predicts its deadline；

It is executed in task scheduling modules distribution calculating task to server, i.e., by hardware isomerism and data locality because usually referring to The scheduling strategy for determining task obtains currently available computing resource by resource monitor in real time, then the program by developing connects Mouth distribution task executes on given server；

Data distribution module is that task is generated to the intermediate data of (key, value) key-value pair format according to certain rule distribution To the process of different Reduce tasks processing；Wherein:

1) data distribution module uses based on bi-distribution to sampling of data and has recorded each key in each Map task In estimated distribution；Whole distribution can be estimated with smaller cost by being sampled analysis for data, and then formulate data point Hair strategy；

Data distribution after sampling analysis is sorted by key, then evenly distributes them to each task, recorded every by 2-1 The starting key and termination key of a task pending data form partition list BK；

2-2, the key that judgement is not belonging to BK can find the subregion that oneself will be assigned by comparison, it is only necessary to which record saves The detailed distribution policy of key in BK；

3) data distribution module distributes data again；

3-1 according to data locality principle, first finds and the Map task run part Map to be allocated of the key in BK Target Reduce task on the same server is allocated；

3-2 is allocated remaining part according still further to first-fit algorithm；When being allocated every time, data distribution module According to the time upper limit t being arranged_maxAnd performance prediction module carries out data cutting, it is ensured that each Reduce task used time is not More than t_max；

3-3 judges that all Reduce tasks are assigned and still have data redundancy, then increase t_maxRestart to distribute, such iteration Until all key are assigned in BK.

2. a kind of data processing accelerated method towards Spark system according to claim 1, it is characterised in that: described Task scheduling modules are according to first scheduling Map task, then dispatch the progress of Reduce task, include the following steps:

1) the upper limit of the number of task can be distributed for every server determination, the upper limit is by server performance indicator h itself and task Total quantity codetermines, the ratio that the ratio of the task quantity of each server is their h；

2) destination server for determining each task is dispatched first, in accordance with data locality principle, and priority scheduling task is to wait locate It manages data institute on the server, if it find that destination server predistribution task has reached the upper limit, then waits next round to dispatch； Second wheel scheduling uses first-fit algorithm, and not yet dispatching for task is assigned on the server that the upper limit of the number has not yet been reached；

3) distribute computing resource unit, after the destination server for determining task, need to distribute specific computing resource unit to Task.