CN109144707A

CN109144707A - A kind of unbalanced method of processing big data platform Spark data distribution

Info

Publication number: CN109144707A
Application number: CN201710456187.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: He Majun; Huang Chaojie; Ren Xiaoqin
Current assignee: Chengdu Zhongke Cluster Information Technology Co., Ltd.
Priority date: 2017-06-16
Filing date: 2017-06-16
Publication date: 2019-01-04

Abstract

The embodiment of the invention discloses a kind of processing big data platform Spark data to distribute unbalanced method, is related to cluster resource scheduling, load balancing field.The data skew problem that the present invention occurs for task in each Stage of Spark, proposes the solution of MRFair, the method includes the steps: (1) active task is set by some tasks on each Stage of Spark；(2) remaining runtime of task is estimated according to the metadata of active task；(3) active task is arranged according to the remaining runtime descending of task, the available block of the maximum task of remaining runtime is reassigned to the smallest task of remaining runtime；(4) remaining runtime of more new task repeats the maximum residual runing time that step (3) are less than other tasks until being scheduled or redistributing the remaining runtime of rear task without available data block.The present invention can effectively reduce the total complete time of Spark operation, promote the service quality of Spark operation.

Description

A kind of unbalanced method of processing big data platform Spark data distribution

Technical field

The present invention relates to online cluster resource dispatching technique field more particularly to a kind of processing big data platform Spark numbers According to the unbalanced method of distribution.

Background technique

Spark is that the memory for carrying out distributed treatment to mass data in a manner of reliable, efficient, telescopic calculates Frame.The main component deployment of Spark cluster be divided into Spark Client, SparkContext, ClusterManager, Worker and Executor etc., as shown in Figure 1.Spark Client submits application program to Spark cluster for user, and SparkContext is communicated for being responsible for ClusterManager, and the application of resource, the distribution of task and monitoring are carried out Deng the life cycle management of responsible job execution.ClusterManager provides the distribution and management of resource, in different fortune Under row mode, the role served as is different, is provided under the operational modes such as Local, Spark Standalone by Master, It is provided under YARN operational mode by Resource Manager, is born under Mesos operational mode by Mesos Manager Duty.After operation of the SparkContext to operation is divided and distribute resource, task can be sent on Worker node Executor run.

The programming model of Spark is elasticity distribution formula data set (Resilient Distributed Dataset, RDD), It is the extension of MapReduce model, efficiently carries out data sharing in the parallel computation stage.Spark application program is mainly The operation that sequence of operations based on RDD is constituted, these operation operators are broadly divided into conversion operation and action operation, conversion operation It is that delay executes, only occurs the submission that action operation just triggers operation.

Most importantly two schedulers of DAGScheduler and TaskScheduler in Spark job scheduling, In, DAGScheduler is responsible for the logic scheduling of task, is the high-level scheduler towards scheduling phase, operation is split into not There is the task-set TaskSet of dependence with the stage, and the scheduling that TaskScheduler is responsible for specific tasks executes.Spark Application program constructs directed acyclic graph (Directed Acyclic Graph, DAG) according to the dependence between RDD, DAGScheduler is responsible for parsing DAG, and DAG is split into complementary scheduling phase.Each scheduling phase includes One or more tasks, these tasks form task-set TaskSet, submit to bottom scheduler by DAGScheduler TaskScheduler is scheduled execution.DAGScheduler monitors traffic control phase process, if some scheduling phase is transported Row failure, then need to resubmit the scheduling phase.

TaskScheduler receives the task-set sended over from DAGScheduler, and TaskScheduler, which is received, to be appointed The Executor for being responsible for task to be distributed to one by one Spark cluster Worker node after business collection goes to run.If some task is transported Row failure, TaskScheduler are responsible for retrying.If TaskScheduler has found some task, not running is complete always, it is possible to Start the same same task of task run, which task, which has first been run, just uses the result of which task.Worker node In Executor receive the task that TaskScheduler is sended over after, run in a manner of multithreading, each thread is negative A task is blamed, returns result to TaskScheduler after task run.

Summary of the invention

The technical problem to be solved by the present invention is it is unbalanced to provide a kind of processing big data platform Spark data distribution Method can reduce the data skew degree of task, accelerate the speed of performing task.

In order to solve the above technical problems, in a first aspect, the embodiment of the invention provides a kind of processing big data platform Spark Data distribute unbalanced method, and the method includes following four big steps:

(1) active task is set by some tasks on each Stage of Spark；

(2) remaining runtime of task is estimated according to the metadata of active task；

(3) active task is arranged according to the remaining runtime descending of task, by the maximum task of remaining runtime Available block is reassigned to the smallest task of remaining runtime；

(4) remaining runtime of more new task, repetition step (3) without available data block until being scheduled or again The remaining runtime of task is less than the maximum residual runing time of other tasks after distribution.

According in a first aspect, in the first possible implementation, tasks some on each scheduling phase of Spark are arranged For active task, active task quantity is m, m≤min (s, n), wherein s is total CPU core number in cluster, and n is the number of partitions of RDD Or scheduling phase total task number, i.e., the quantity of active task is no more than CPU core number, the number of partitions of RDD or scheduling rank in cluster Minimum value in section total task number；Active task cannot be actively out of service when task execution closes to an end, and only receives Could be out of service when message of ceasing and desisting order, this is achieved in that the untreatment data block of time longest task will be executed It is assigned to other task executions.

According in a first aspect, in the second possible implementation, at most one task of selection is allocated every time, because Occupy what the task that another has been completed discharged to divide data block in a task and being transmitted to other tasks Idle cpu resource, at the same the effect for redistributing two tasks only do not redistribute a task effect it is good, redistribute One task allows the task sufficiently to use remaining resource；MRFair Master estimates that the remaining of all tasks executes Time, and selection remaining time longest task is redistributed when detecting；Master will guarantee that remaining time is enough to count Operation plan is calculated, if MRFair determines to redistribute the available block in task T, needs to meet following two condition:

(1) at least one CPU core is idle to system at present；

(2) it redistributes only to add in the task T original execution time greater than the execution time after redistributing and divide Extra time expense with needs is just significant；

For long-time execute task, the influence of data skew be it is very serious, redistribute task additionally open Pin can be ignored substantially.

According in a first aspect, in the third possible implementation, after user submits operation, just starting simply by the presence of not Scheduled task, MRFair call tradition Schedule Backend module schedules to distribute task；If Master node handle All tasks all dispatching distributions are over, and MRFair is activated by the detection module in MRFair Worker, appoint to what is be carrying out It is engaged in carrying out tilt detection based on the remaining runtime value estimated, as shown in Figure 2；If the remaining runtime value of task exists Significant differences (divide remaining task be valuable) report that the information gives MRFair Master, and by the task ID and It estimates in the Hash list in remaining runtime value deposit MRFair Master node, waits subsequent processing.

According in a first aspect, in the fourth possible implementation, when Master node is run according to the residue of task When time descending arranges active task, need the longest task T1 operation suspension of remaining runtime, i.e. MRFair Master Notice MRFair Worker suspends the execution of the task, and captures the position for the input data that it is finally handled and allow to jump Cross previous processed input data；Can not or it be difficult to stop if the longest task T1 of remaining runtime is in a kind of State Master node or reselect the task T2 of a remaining runtime vice-minister and right then the request fails The task carries out the above processing, or if the longest task T1 of this remaining runtime is the last one task in operation In the case of, the subregion and task T1 of re-executing is fully entered, re-executes the longest task T1 of remaining runtime again The supposition implementation strategy fully entered just as Spark.

According in a first aspect, in a fifth possible implementation, selecting data block rather than byte is as data weight The unit of distribution, because the input data for scanning according to byte one by one and dividing task will lead to the long-time of Executor Obstruction, this cannot be tolerated；In order to which with the metadata of less cost maintenance data block, MRFair attempts fair scheduling Data block, although the size of each data block is unequal；In order to reach target, Master collects all task and its data block Metadata, and store data in local, be then based on the strategy of runing time to determine which task is preferentially scheduled；It is fixed Adopted data block has following five kinds of states:

(1) LocalFetched: data block is located locally node；

(2) RemoteFetchWaiting: data block is located at remote node, at the same etc. to be sent obtain asking for data block It asks；

(3) RemoteFetching: data block is located at remote node, while having initiated to obtain the request of data block；

(4) RemoteFetched: data block is located at remote node, while having obtained data block to locally；

(5) Used: data block is used.

According in a first aspect, in a sixth possible implementation, the remaining runtime estimation of task is to determine such as What redistributes an important factor for available block；If the state of data block be LocalFetched or RemoteFetched, then the remaining runtime of the data block be the data block size divided by data block in local Executor On calculating speed；If the state of data block is RemoteFetchWaiting or RemoteFetchWaiting, at this time If data block is on the Executor of local, the remaining runtime of the data block is that the data block size exists divided by data block Calculating speed on local Executor；Otherwise, the remaining runtime of the data block is the data block size divided by data block Local Executor calculating speed plus the data block size divided by data block from distal end Executor to local The speed of download of Executor；The remaining runtime of data block in other states is 0；According to the possessed data block of task Metadata, it can be deduced that the remaining runtime of entire task.

According in a first aspect, the longest task of remaining runtime is preferentially weighed in the 7th kind of possible implementation New distribution redistributes algorithm using the metadata tasks and remaining runtime prediction device θ of all active tasks as defeated Enter；Remaining runtime prediction device θ estimates processing time of the data block on specific Executor according to statistical data, It suppose there is m movable tasks, metadata tasks={ task₁,task₂,...,task_m, redistribute algorithm basis first The remaining runtime descending of active task arranges active task task_f> task_i> ... > task_t(f,i,...t∈[1, M]), then choose the longest task task of remaining runtime_f, its all not processed data block is filtered, result is saved Into a list blocks, if there is the available data block of n block, then blocks={ block₁,block₂,...,block_n, Algorithm traverses all available data blocks, then data block is passed to the task task being initially completed_t, final updating task is surplus The Hash table of remaining runing time, tasks list of resequencing, and the data block being scheduled is removed from blocks list, weight The multiple process, until meeting one of following two condition:

(1) blocks=Φ；

(2)

That is, the remaining runtime of rear task can be scheduled or redistributed without available data block less than other tasks Maximum residual runing time.

Second aspect, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced dress It sets, including dispatching device described in any possible implementation of first aspect or first aspect.

The third aspect, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced function Consume reduction method, which is characterized in that the Spark group system uses first aspect or any possible realization of first aspect Method described in mode is scheduled.

Detailed description of the invention

Fig. 1 is the system architecture diagram of the Spark cluster of an embodiment of the present invention；

Fig. 2 is the execution flow chart of the MRFair system detection inclination task of an embodiment of the present invention；

Fig. 3 is the flow chart of the MRFair system data equilibrium assignment method of an embodiment of the present invention；

Fig. 4 is the sample explanatory diagram of an embodiment of the present invention；

Specific embodiment

Below according to drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.

As shown in figure 3, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced side Method, the method comprising the steps of:

S101. active task is set by some tasks on each Stage of Spark.

S102. the remaining runtime of task is estimated according to the metadata of active task.

S103. active task is arranged according to the remaining runtime descending of task, by the maximum task of remaining runtime Available block be reassigned to the smallest task of remaining runtime.

S104. the remaining runtime of more new task, repeat step S103 until without available data block it is scheduled or The remaining runtime for redistributing rear task is less than the maximum residual runing time of other tasks.

It will be understood by those skilled in the art that the serial number size of each step is not in the method for various embodiments of the present invention Mean the successive of execution sequence, the execution sequence of each step should be determined by its function and internal logic, without coping with the present invention The implementation process of specific embodiment constitutes any restriction.

The Spark collection of the embodiment of the invention also provides a kind of dispatching device shown in Fig. 3 including the embodiment of the present invention Group's system, the group system can be disposed according to framework shown in FIG. 1, which can be task dispatcher shown in Fig. 1.

Various embodiments of the present invention are further illustrated below by way of specific example:

Sample explanatory diagram is detailed in Fig. 4, using default Spark Standalone operational mode, wherein Spark's is same There are four tasks T1, T2, T3, T4 in stage, and current cluster environment there are four available resources Slot1, Slot2, Slot3,Slot4.When initial, it is 4, MRFair system in (the i.e. Task T1 completion of t1 moment that current active task number, which is arranged, When) estimation tasks remaining runtime, detect that Task T2 is the longest task of remaining runtime, and there are serious Data skew enters data to influence caused by slowing down inclination by the way that subregion T2 again is untreated；In fact, MRFair is not only By untreated T2 data again subregion to Slot 1 and Slot 2, Slot 3 is equally given, Slot 3 is completed in Task T3 Shi Chengwei idling-resource, the MRFair untreated Task T2 related data of subregion again, thus can greatly utilize Resource accelerates the execution speed of task, reduces total complete time；Task T2a, T2b, the T2c redistributed, it is referred to as former not locate The pre-segmentation task (abbreviation pre-segmentation task) of task is managed, and to be dispatched by the scheduling mode of longest processing time priority； The execution of MRFair circulation " estimate, the untreatment data block of remaining runtime longest task divides again by tasks leave runing time With " this strategy, until all task executions finish；At the t2 moment, MRFair has found T4 for next remaining runtime most Long task reduces the influence of data skew by subregion task T4 residue untreatment data again.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage and be situated between In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be Magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of processing big data platform Spark data distribute unbalanced method, which is characterized in that the method includes steps It is rapid:

(1) active task is set by some tasks on each Stage of Spark；

(3) active task is arranged according to the remaining runtime descending of task, by the available of the maximum task of remaining runtime Data block is reassigned to the smallest task of remaining runtime；

(4) remaining runtime of more new task repeats step (3) until being scheduled or redistributing without available data block The remaining runtime of task is less than the maximum residual runing time of other tasks afterwards.

2. the method according to claim 1, wherein setting living for some tasks on each scheduling phase of Spark Dynamic task, active task quantity are m, m≤min (s, n), wherein s is total CPU core number in cluster, and n is the number of partitions or tune of RDD Phased mission sum is spent, i.e. the quantity of active task is appointed no more than CPU core number, the number of partitions of RDD or scheduling phase in cluster Minimum value in business sum；Active task cannot be actively out of service when task execution closes to an end, and only receives stopping Could be out of service when command messages, this is achieved in that the distribution of the untreatment data block of time longest task will be executed To other task executions.

3. the method according to claim 1, wherein MRFair every time at most select a task be allocated, Because dividing data block in a task and being transmitted to other tasks and occupy another task release completed Idle cpu resource, while the effect for redistributing two tasks only do not redistribute a task effect it is good, divide again The task is allowed sufficiently to use remaining resource with a task；MRFair Master estimates that the residue of all tasks is held The row time, and selection remaining time longest task is redistributed when detecting；Master will guarantee that remaining time is enough Operation plan is calculated, if MRFair determines to redistribute the available block in task T, needs to meet following two item Part:

(1) at least one CPU core is idle to system at present；

(2) redistributing only needs in the execution time that the task T original execution time is greater than after redistributing plus distribution The extra time expense wanted is just significant；

For long-time execute task, the influence of data skew be it is very serious, redistribute the overhead base of task Originally it can be ignored.

4. the method according to claim 1, wherein just starting simply by the presence of not adjusted after user submits operation The task of degree, MRFair call tradition Schedule Backend module schedules to distribute task；If Master node is all Task all dispatching distributions be over, MRFair is activated by the detection module in MRFair Worker, to the task base being carrying out Tilt detection is carried out in the remaining runtime value estimated, as shown in Figure 2；If the remaining runtime value of task exists serious Difference (it is valuable for dividing remaining task), reports that the information gives MRFair Master, and by the task ID and estimate Remaining runtime value is stored in the Hash list in MRFair Master node, waits subsequent processing.

5. the method according to claim 1, wherein when Master node is dropped according to the remaining runtime of task When sequence arranges active task, need the longest task T1 operation suspension of remaining runtime, i.e. MRFair Master notice MRFair Worker suspends the execution of the task, and captures the position for the input data that it is finally handled and allow to skip elder generation Preceding processed input data；If the longest task T1 of remaining runtime is in a kind of shape that can not or be difficult to stop State Master node or reselects the task T2 of a remaining runtime vice-minister and to this then the request fails Business carries out the above processing, or if the longest task T1 of this remaining runtime is the feelings of the last one task in operation Under shape, the subregion and task T1 of re-executing is fully entered, re-executes the complete of the longest task T1 of remaining runtime again Portion inputs the supposition implementation strategy just as Spark.

6. the method according to claim 1, wherein selecting data block rather than byte is as data Redistribution Unit, because the long-time that the input data that task is scanned and divided according to byte one by one will lead to Executor is blocked, This cannot be tolerated；In order to which with the metadata of less cost maintenance data block, MRFair attempts fair scheduling data Block, although the size of each data block is unequal；In order to reach target, Master collects the member of all task and its data block Data, and local is stored data in, the strategy of runing time is then based on to determine which task is preferentially scheduled；Define number There is following five kinds of states according to block:

(1) LocalFetched: data block is located locally node；

(2) RemoteFetchWaiting: data block is located at remote node, at the same etc. the request to be sent for obtaining data block；

(5) Used: data block is used.

7. the method according to claim 1, wherein the remaining runtime estimation of task is determining how again An important factor for distributing available block；If the state of data block is LocalFetched or RemoteFetched, should The remaining runtime of data block is the calculating speed of the data block size divided by data block on local Executor；If number State according to block is RemoteFetchWaiting or RemoteFetchWaiting, if data block is in local at this time On Executor, then the remaining runtime of the data block be the data block size divided by data block on local Executor Calculating speed；Otherwise, the remaining runtime of the data block be the data block size divided by data block local Executor's Calculating speed is plus the data block size divided by data block from distal end Executor to the speed of download of local Executor；It is in The remaining runtime of the data block of other states is 0；According to the metadata of the possessed data block of task, it can be deduced that entire to appoint The remaining runtime of business.

8. the method according to claim 1, wherein the longest task of remaining runtime is preferentially divided again Match, redistributes algorithm using the metadata tasks and remaining runtime prediction device θ of all active tasks as input；It is surplus Remaining runing time prediction device θ estimates processing time of the data block on specific Executor, it is assumed that have according to statistical data M movable tasks, metadata tasks={ task₁,task₂,...,task_m, it redistributes algorithm and is appointed first according to activity The remaining runtime descending of business arranges active task task_f> task_i> ... > task_t(f, i ... t ∈ [1, m]), then Choose the longest task task of remaining runtime_f, its all not processed data block is filtered, result is saved to one and is arranged In table blocks, if there is the available data block of n block, then blocks={ block₁,block₂,...,block_n, algorithm traversal Then data block is passed to the task task being initially completed by all available data blocks_t, when final updating tasks leave is run Between Hash table, tasks list of resequencing and removes the data block be scheduled, repeatedly the process from blocks list, Until meeting one of following two condition:

(1) blocks=Φ；

(2)

That is, the remaining runtime of rear task can be scheduled or redistributed without available data block most less than other tasks Big remaining runtime.