A kind of unbalanced method of processing big data platform Spark data distribution
Technical field
The present invention relates to online cluster resource dispatching technique field more particularly to a kind of processing big data platform Spark numbers
According to the unbalanced method of distribution.
Background technique
Spark is that the memory for carrying out distributed treatment to mass data in a manner of reliable, efficient, telescopic calculates
Frame.The main component deployment of Spark cluster be divided into Spark Client, SparkContext, ClusterManager,
Worker and Executor etc., as shown in Figure 1.Spark Client submits application program to Spark cluster for user, and
SparkContext is communicated for being responsible for ClusterManager, and the application of resource, the distribution of task and monitoring are carried out
Deng the life cycle management of responsible job execution.ClusterManager provides the distribution and management of resource, in different fortune
Under row mode, the role served as is different, is provided under the operational modes such as Local, Spark Standalone by Master,
It is provided under YARN operational mode by Resource Manager, is born under Mesos operational mode by Mesos Manager
Duty.After operation of the SparkContext to operation is divided and distribute resource, task can be sent on Worker node
Executor run.
The programming model of Spark is elasticity distribution formula data set (Resilient Distributed Dataset, RDD),
It is the extension of MapReduce model, efficiently carries out data sharing in the parallel computation stage.Spark application program is mainly
The operation that sequence of operations based on RDD is constituted, these operation operators are broadly divided into conversion operation and action operation, conversion operation
It is that delay executes, only occurs the submission that action operation just triggers operation.
Most importantly two schedulers of DAGScheduler and TaskScheduler in Spark job scheduling,
In, DAGScheduler is responsible for the logic scheduling of task, is the high-level scheduler towards scheduling phase, operation is split into not
There is the task-set TaskSet of dependence with the stage, and the scheduling that TaskScheduler is responsible for specific tasks executes.Spark
Application program constructs directed acyclic graph (Directed Acyclic Graph, DAG) according to the dependence between RDD,
DAGScheduler is responsible for parsing DAG, and DAG is split into complementary scheduling phase.Each scheduling phase includes
One or more tasks, these tasks form task-set TaskSet, submit to bottom scheduler by DAGScheduler
TaskScheduler is scheduled execution.DAGScheduler monitors traffic control phase process, if some scheduling phase is transported
Row failure, then need to resubmit the scheduling phase.
TaskScheduler receives the task-set sended over from DAGScheduler, and TaskScheduler, which is received, to be appointed
The Executor for being responsible for task to be distributed to one by one Spark cluster Worker node after business collection goes to run.If some task is transported
Row failure, TaskScheduler are responsible for retrying.If TaskScheduler has found some task, not running is complete always, it is possible to
Start the same same task of task run, which task, which has first been run, just uses the result of which task.Worker node
In Executor receive the task that TaskScheduler is sended over after, run in a manner of multithreading, each thread is negative
A task is blamed, returns result to TaskScheduler after task run.
Summary of the invention
The technical problem to be solved by the present invention is it is unbalanced to provide a kind of processing big data platform Spark data distribution
Method can reduce the data skew degree of task, accelerate the speed of performing task.
In order to solve the above technical problems, in a first aspect, the embodiment of the invention provides a kind of processing big data platform Spark
Data distribute unbalanced method, and the method includes following four big steps:
(1) active task is set by some tasks on each Stage of Spark;
(2) remaining runtime of task is estimated according to the metadata of active task;
(3) active task is arranged according to the remaining runtime descending of task, by the maximum task of remaining runtime
Available block is reassigned to the smallest task of remaining runtime;
(4) remaining runtime of more new task, repetition step (3) without available data block until being scheduled or again
The remaining runtime of task is less than the maximum residual runing time of other tasks after distribution.
According in a first aspect, in the first possible implementation, tasks some on each scheduling phase of Spark are arranged
For active task, active task quantity is m, m≤min (s, n), wherein s is total CPU core number in cluster, and n is the number of partitions of RDD
Or scheduling phase total task number, i.e., the quantity of active task is no more than CPU core number, the number of partitions of RDD or scheduling rank in cluster
Minimum value in section total task number;Active task cannot be actively out of service when task execution closes to an end, and only receives
Could be out of service when message of ceasing and desisting order, this is achieved in that the untreatment data block of time longest task will be executed
It is assigned to other task executions.
According in a first aspect, in the second possible implementation, at most one task of selection is allocated every time, because
Occupy what the task that another has been completed discharged to divide data block in a task and being transmitted to other tasks
Idle cpu resource, at the same the effect for redistributing two tasks only do not redistribute a task effect it is good, redistribute
One task allows the task sufficiently to use remaining resource;MRFair Master estimates that the remaining of all tasks executes
Time, and selection remaining time longest task is redistributed when detecting;Master will guarantee that remaining time is enough to count
Operation plan is calculated, if MRFair determines to redistribute the available block in task T, needs to meet following two condition:
(1) at least one CPU core is idle to system at present;
(2) it redistributes only to add in the task T original execution time greater than the execution time after redistributing and divide
Extra time expense with needs is just significant;
For long-time execute task, the influence of data skew be it is very serious, redistribute task additionally open
Pin can be ignored substantially.
According in a first aspect, in the third possible implementation, after user submits operation, just starting simply by the presence of not
Scheduled task, MRFair call tradition Schedule Backend module schedules to distribute task;If Master node handle
All tasks all dispatching distributions are over, and MRFair is activated by the detection module in MRFair Worker, appoint to what is be carrying out
It is engaged in carrying out tilt detection based on the remaining runtime value estimated, as shown in Figure 2;If the remaining runtime value of task exists
Significant differences (divide remaining task be valuable) report that the information gives MRFair Master, and by the task ID and
It estimates in the Hash list in remaining runtime value deposit MRFair Master node, waits subsequent processing.
According in a first aspect, in the fourth possible implementation, when Master node is run according to the residue of task
When time descending arranges active task, need the longest task T1 operation suspension of remaining runtime, i.e. MRFair Master
Notice MRFair Worker suspends the execution of the task, and captures the position for the input data that it is finally handled and allow to jump
Cross previous processed input data;Can not or it be difficult to stop if the longest task T1 of remaining runtime is in a kind of
State Master node or reselect the task T2 of a remaining runtime vice-minister and right then the request fails
The task carries out the above processing, or if the longest task T1 of this remaining runtime is the last one task in operation
In the case of, the subregion and task T1 of re-executing is fully entered, re-executes the longest task T1 of remaining runtime again
The supposition implementation strategy fully entered just as Spark.
According in a first aspect, in a fifth possible implementation, selecting data block rather than byte is as data weight
The unit of distribution, because the input data for scanning according to byte one by one and dividing task will lead to the long-time of Executor
Obstruction, this cannot be tolerated;In order to which with the metadata of less cost maintenance data block, MRFair attempts fair scheduling
Data block, although the size of each data block is unequal;In order to reach target, Master collects all task and its data block
Metadata, and store data in local, be then based on the strategy of runing time to determine which task is preferentially scheduled;It is fixed
Adopted data block has following five kinds of states:
(1) LocalFetched: data block is located locally node;
(2) RemoteFetchWaiting: data block is located at remote node, at the same etc. to be sent obtain asking for data block
It asks;
(3) RemoteFetching: data block is located at remote node, while having initiated to obtain the request of data block;
(4) RemoteFetched: data block is located at remote node, while having obtained data block to locally;
(5) Used: data block is used.
According in a first aspect, in a sixth possible implementation, the remaining runtime estimation of task is to determine such as
What redistributes an important factor for available block;If the state of data block be LocalFetched or
RemoteFetched, then the remaining runtime of the data block be the data block size divided by data block in local Executor
On calculating speed;If the state of data block is RemoteFetchWaiting or RemoteFetchWaiting, at this time
If data block is on the Executor of local, the remaining runtime of the data block is that the data block size exists divided by data block
Calculating speed on local Executor;Otherwise, the remaining runtime of the data block is the data block size divided by data block
Local Executor calculating speed plus the data block size divided by data block from distal end Executor to local
The speed of download of Executor;The remaining runtime of data block in other states is 0;According to the possessed data block of task
Metadata, it can be deduced that the remaining runtime of entire task.
According in a first aspect, the longest task of remaining runtime is preferentially weighed in the 7th kind of possible implementation
New distribution redistributes algorithm using the metadata tasks and remaining runtime prediction device θ of all active tasks as defeated
Enter;Remaining runtime prediction device θ estimates processing time of the data block on specific Executor according to statistical data,
It suppose there is m movable tasks, metadata tasks={ task1,task2,...,taskm, redistribute algorithm basis first
The remaining runtime descending of active task arranges active task taskf> taski> ... > taskt(f,i,...t∈[1,
M]), then choose the longest task task of remaining runtimef, its all not processed data block is filtered, result is saved
Into a list blocks, if there is the available data block of n block, then blocks={ block1,block2,...,blockn,
Algorithm traverses all available data blocks, then data block is passed to the task task being initially completedt, final updating task is surplus
The Hash table of remaining runing time, tasks list of resequencing, and the data block being scheduled is removed from blocks list, weight
The multiple process, until meeting one of following two condition:
(1) blocks=Φ;
(2)
That is, the remaining runtime of rear task can be scheduled or redistributed without available data block less than other tasks
Maximum residual runing time.
Second aspect, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced dress
It sets, including dispatching device described in any possible implementation of first aspect or first aspect.
The third aspect, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced function
Consume reduction method, which is characterized in that the Spark group system uses first aspect or any possible realization of first aspect
Method described in mode is scheduled.
Detailed description of the invention
Fig. 1 is the system architecture diagram of the Spark cluster of an embodiment of the present invention;
Fig. 2 is the execution flow chart of the MRFair system detection inclination task of an embodiment of the present invention;
Fig. 3 is the flow chart of the MRFair system data equilibrium assignment method of an embodiment of the present invention;
Fig. 4 is the sample explanatory diagram of an embodiment of the present invention;
Specific embodiment
Below according to drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below
Example is not intended to limit the scope of the invention for illustrating the present invention.
As shown in figure 3, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced side
Method, the method comprising the steps of:
S101. active task is set by some tasks on each Stage of Spark.
S102. the remaining runtime of task is estimated according to the metadata of active task.
S103. active task is arranged according to the remaining runtime descending of task, by the maximum task of remaining runtime
Available block be reassigned to the smallest task of remaining runtime.
S104. the remaining runtime of more new task, repeat step S103 until without available data block it is scheduled or
The remaining runtime for redistributing rear task is less than the maximum residual runing time of other tasks.
It will be understood by those skilled in the art that the serial number size of each step is not in the method for various embodiments of the present invention
Mean the successive of execution sequence, the execution sequence of each step should be determined by its function and internal logic, without coping with the present invention
The implementation process of specific embodiment constitutes any restriction.
The Spark collection of the embodiment of the invention also provides a kind of dispatching device shown in Fig. 3 including the embodiment of the present invention
Group's system, the group system can be disposed according to framework shown in FIG. 1, which can be task dispatcher shown in Fig. 1.
Various embodiments of the present invention are further illustrated below by way of specific example:
Sample explanatory diagram is detailed in Fig. 4, using default Spark Standalone operational mode, wherein Spark's is same
There are four tasks T1, T2, T3, T4 in stage, and current cluster environment there are four available resources Slot1, Slot2,
Slot3,Slot4.When initial, it is 4, MRFair system in (the i.e. Task T1 completion of t1 moment that current active task number, which is arranged,
When) estimation tasks remaining runtime, detect that Task T2 is the longest task of remaining runtime, and there are serious
Data skew enters data to influence caused by slowing down inclination by the way that subregion T2 again is untreated;In fact, MRFair is not only
By untreated T2 data again subregion to Slot 1 and Slot 2, Slot 3 is equally given, Slot 3 is completed in Task T3
Shi Chengwei idling-resource, the MRFair untreated Task T2 related data of subregion again, thus can greatly utilize
Resource accelerates the execution speed of task, reduces total complete time;Task T2a, T2b, the T2c redistributed, it is referred to as former not locate
The pre-segmentation task (abbreviation pre-segmentation task) of task is managed, and to be dispatched by the scheduling mode of longest processing time priority;
The execution of MRFair circulation " estimate, the untreatment data block of remaining runtime longest task divides again by tasks leave runing time
With " this strategy, until all task executions finish;At the t2 moment, MRFair has found T4 for next remaining runtime most
Long task reduces the influence of data skew by subregion task T4 residue untreatment data again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage and be situated between
In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be
Magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access
Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.