CN105718244A

CN105718244A - Streamline data shuffle Spark task scheduling and executing method

Info

Publication number: CN105718244A
Application number: CN201610029211.7A
Authority: CN
Inventors: 付周望; 张未雨; 戚正伟; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-01-18
Filing date: 2016-01-18
Publication date: 2016-06-29
Anticipated expiration: 2036-01-18
Also published as: CN105718244B

Abstract

The invention discloses a streamline data shuffle Spark task scheduling and executing method. According to the method, stages and tasks in the stages are submitted and executed from back side to front side; meanwhile, a predecessor task is used for sending an execution result to a memory of a successor task; and under the conditions that a user interface is not changed, and the completeness and the fault tolerance of the stages are also not damaged, the problem of disk accessing expenditure of the shuffle of the original Spark in different stages is solved, so that the operation time of a distributed calculation program on the Spark is reduced.

Description

A kind of streamlined data are shuffled Spark task scheduling and the execution method of transmission

Technical field

The present invention relates to computer distribution type Computational frame field.Specifically, on the basis of distributed computing framework Spark, his Task Scheduling Mechanism is mainly changed, thus promoting the performance of this Computational frame.

Background technology

Spark, as current most widely used distributed computing framework, has been deployed in countless data centers.The distributed rebound data collection (ResilientDistributedDataset, RDD) that it proposes makes carrying out in internal memory of the calculating process maximum possible of big data.Performing in logic, Spark generates RDD from front to back according to the logic of user program, and each RDD can have the dependence of oneself.When the program of user needs final output result, Spark will find forward from last RDD recurrence, and divides the stage (Stage) according to transmission dependence (ShuffleDependency) of shuffling wherein existed.After having divided the stage, Spark will presentation stage from front to back, first submit the stage not lacking dependence to, successively backward.This scheduling logic makes the calculative position of data automatic stream, and allows and calculate being saved in internal memory of intermediate object program maximum possible.

However, to ensure that segmentation between the stage (Stage) and the fault-tolerance of framework own, when the transmission of shuffling dividing each stage relies on (ShuffleDependency), the intermediate object program that prodromic phase produces can be stored in disk by Spark, then start to distribute the task in next stage, gone the data in long-range reading disk by the task in stage more afterwards, be then calculated.

Being much more slowly than in the reality of internal memory at current disk speed, the reading and writing data of this part becomes the maximum bottleneck of restriction Spark performance boost.But for fear of the integrity of division in stage and fault-tolerance, remain without the patch for optimizing this part bottleneck at present or solution occurs.

Summary of the invention

The present invention is directed to the bottleneck of reading writing harddisk when Spark data shuffle transmission, propose a kind of streamlined data to shuffle the Spark method for scheduling task of transmission (Shuffle), by changing the submission order of Spark task, a task is made to start, prior to his predecessor task, the distribution that is scheduled, adopt predecessor task to send simultaneously and perform result to the method in the internal memory of subsequent tasks, do not changing user interface, not while the integrity of failure stage and fault-tolerance, solve originally Spark to shuffle in different phase (Stage) the disk read-write expense of data transmission (Shuffle).Making predecessor task send the data to subsequent tasks by network while the generation intermediate object program performed, thus avoiding the read-write of disk I/O, promoting the performance of Spark distributed computing framework.

The technical solution of the present invention is as follows:

A kind of streamlined data are shuffled Spark task scheduling and the execution method of transmission, comprise the steps:

Step 1: when Spark submit to a task and this task be divided into multiple stage submit to time, first find user to perform task and generate the last stage of result；

Step 2: from the last stage, it is judged that whether this stage comprises the prodromic phase being not fully complete:

If the prodromic phase in this stage has all performed, then this stage is submitted to perform；

If there being prodromic phase not to be performed, then by this phased markers for waiting, submit to this stage to perform simultaneously, and recurrence submits the prodromic phase in this stage to；

Step 3: after submitting to a stage to perform, whether this stage resolutions is become multiple task by scheduler, and judges this stage: be loitering phase:

If this stage is marked as wait, then scheduler performs node to the free time that explorer request is identical with task number, after scheduler obtains the corresponding execution node performing task, the dependence recurrence of the distributed rebound data collection comprised according to this stage is found forward transmission of shuffling and is relied on, scheduler often find one shuffle transmission rely on will to mapping output trace table register this time shuffle transmission flowing water information, after registration is complete, scheduler is also notified that each execution node that namely will run this task gets out corresponding internal memory and carrys out the intermediate object program that their predecessor task of buffer memory sends；After each execution node receives the log-on message of scheduler, can in local cache newly-built one with shuffle transmission rely on ID for index, value is the key-value pair of the always several buffer memory array of conventions data block, also can rely on ID for index at local newly-built one with transmission of shuffling simultaneously, value is the key-value pair of the semaphore data structure of conventions data block sum, wherein each semaphore comprises the duty mapping sum of shuffling relied on of shuffling specifically；

Otherwise, it is directly entered next step；

Step 4: when the set of tasks of scheduler encapsulated phase, it is judged that whether this stage is a mapping phase of shuffling:

If this stage is a mapping phase of shuffling, then the transmission of shuffling that each task in this stage arranges correspondence relies on ID；

Otherwise, it is directly entered next step；

Step 5: packaged task is distributed to each and performs node by scheduler；

Step 6: when performing on the executor that task is assigned to each execution node, executor can judge that this task is whether for mapping tasks of shuffling:

If, the transmission of shuffling then comprised according to this task relies on ID, to mapping the aggregate information that output trace table asks the execution node of stipulations task corresponding for this ID, then, the corresponding stipulations information of mapping tasks of shuffling is set, conventions data block number in the combining information received and remote address are packaged into a Hash table and pass to this mapping tasks of shuffling, and enter step 7；

If executor judges that this task is stipulations task, the function calling this stipulations task corresponding is calculated, and enters step 11；

Step 7: when a mapping tasks of shuffling starts to perform, can check the need for the output of streamlined data；

The need to, intermediate object program key-value pair is calculated the conventions data block number of his correspondence by the grader of the grader first specified according to user or Spark acquiescence according to key, according to the data block number arranged and remote address Hash table, the execution node that the data result obtained is sent to the responsible follow-up stipulations task of correspondence will be calculated, the information sent includes: transmission of shuffling relies on ID, conventions data block number, the key-value pair of data result；While sending data, executor is to disk write data, and enters step 8；Meanwhile, after the execution node of responsible stipulations task receives pipelined data, can, using the data dependence ID that shuffles as index, preserve in the buffer memory of conventions data block number of buffer memory array corresponding for this ID, enter step 8；

If it is not required, then result write disk directly will be performed, enter step 8；

Step 8: executor completes mapping tasks of shuffling；

Step 9: when one shuffle mapping tasks end of run time, flowing water ending message will be sent to the execution node of all of responsible stipulations task, this information includes: transmission of shuffling relies on ID, the mapping data block number that this task is responsible for, corresponding with this information performs the conventions data block number that node is responsible for；

Step 10: after the execution node of responsible stipulations task receives the information that flowing water terminates, can rely on ID as index according to transmission of shuffling, find semaphore array corresponding for this ID, subtract one by wherein conventions data block number CountDownLatch.If this semaphore is reduced to 0, then it represents that the data that this conventions data block relies on map whole ends of transmission；

Step 11: when executor performs the specified function of stipulations task, can call corresponding stipulations function, and this function, when the task of execution reads data, can ask an iterator reading in data to performing node；

Step 12: to performing node inquiry shuffles to transmit whether have local cache specifically, namely whether can be transmitted by streamlined data when generating iterator:

If it is, call the acquisition caching method performing node, rely on conventions data block number responsible with this task for ID to performing node request buffer memory according to the transmission of shuffling of stipulations task, and enter step 13；

Otherwise, read teledata, enter step 15；

Step 13: perform after node receives and obtain the calling of buffer memory, to rely on ID for index with transmission of shuffling, finds buffer memory array corresponding in buffer memory, and returns conventions data block number the asynchronous of buffer memory and quote；

Step 14: iterator receive buffer memory asynchronous quote time, start waiting for, until the CountDownLatch semaphore in the conventions data block number of the transmission dependence of shuffling of this required by task becomes 0, represent that the mapping data block that this conventions data block relies on is fully completed, enter step 15；

Step 15: executor performs to specify stipulations function.

Compared with prior art, the invention has the beneficial effects as follows: can significantly shorten data in Spark tasks carrying process and shuffle the time transmitted, shorten the execution time of distributed computing task.

Accompanying drawing explanation

Fig. 1. configuration diagram

Fig. 2. scheduler schedules method flow diagram

Fig. 3. perform node flow chart

Fig. 4. flowing water log-on message

Fig. 5. flowing water announcement information

Fig. 6. flowing water transmission information

Fig. 7. flowing water ending message

Fig. 8. perform nodal cache framework

Fig. 9. perform node signal and control framework

Specific implementation method

Below with reference to accompanying drawing, embodiments of the invention are elaborated.The present embodiment is carried out under the premise of technical solution of the present invention and algorithm, and provides detailed embodiment and specific operation process, but is suitable for platform and is not limited to following embodiment.As long as the Spark of cluster compatibility master just can run the present invention.The small-sized cluster that the concrete operations platform of this example is made up of two common servers, equipped with UbuntuServer14.04.1LTS64bit on each server, and is equipped with 8GB internal memory.The concrete exploitation of the present invention is based on the source code version of ApacheSpark1.5.

As shown in Figure 1, the present invention includes scheduler (DAGScheduler) based on the original framework of Spark, explorer (BlockMangerMaster), map output trace table (MapOutputTracker) and perform node (BlockManger) and executor (Executor), by changing task scheduling algorithm and performing flow process and realize the streamlined that the data of shuffling in not the completing property of failure stage and the basis of fault-tolerance are transmitted.Scheduler and execution node are respectively according to accompanying drawing 2 and the process flow operation in accompanying drawing 3, thus realizing the lifting of Distributed Calculation performance.

All deploying the Spark comprising the present invention on every station server, wherein a station server is as the Master of Spark cluster, and an other machine is as the Slave of cluster.It should be noted that the performance in order to ensure the present invention, the cluster response of deployment configures the internal memory bigger than original Spark cluster, and concrete memory size is depending on the data volume size of operation.

Namely Distributed Calculation application can be run according to the method for operation of Spark after having disposed.Change the user for using Spark Computational frame fully transparent.

Otherwise, it is directly entered next step；

Step 5: packaged task is distributed to each and performs node by scheduler；

Step 8: executor completes mapping tasks of shuffling；

Otherwise, read teledata, enter step 15；

Step 15: executor performs to specify stipulations function.

When performing above step, one step of any of which is made mistakes and all can be triggered fault-tolerance mechanism: if mistake betides the predecessor task of the transmission of shuffling of a streamlined, namely shuffle the step of mapping tasks, and the either step before this step, so his follow-up all can be marked as failure and resubmit, and continues executing with the transmission of shuffling of streamlined；If mistake betides the subsequent tasks of the transmission of shuffling of a streamlined, the i.e. execution step of stipulations task, so predecessor task can't be affected, and the subsequent tasks of failure can be resubmited, the subsequent tasks now resubmited will go to read required data from the disk of predecessor task.

Due to the node of the pending stipulations tasks such as the present embodiment transmits data to while mapping tasks is shuffled in execution, thus the waiting time of transmission of being shuffled by script Spark stashes.When mapping tasks of shuffling terminates, stipulations task just can start within the more shorter time, thus accelerating the speed of whole Distributed Calculation.

By the benchmark program of the relevant Spark such as WordCount on the basis of this embodiment, demonstrating the correctness of the present invention, the present invention has lifting in various degree compared to the Spark of master in performance in different benchmark programs simultaneously.

The preferred embodiment of the present invention described in detail above.Should be appreciated that the ordinary skill of this area just can make many modifications and variations according to the design of the present invention without creative work.Therefore, all technical staff in the art, all should in the protection domain being defined in the patent claims under this invention's idea on the basis of existing technology by the available technical scheme of logical analysis, reasoning, or a limited experiment.

Claims

1. streamlined data are shuffled the Spark task scheduling of transmission and execution method, it is characterised in that comprise the steps:

Otherwise, it is directly entered next step；

Step 5: packaged task is distributed to each and performs node by scheduler；

Step 8: executor completes mapping tasks of shuffling；

Otherwise, read teledata, enter step 15；

Step 15: executor performs to specify stipulations function.

2. streamlined data according to claim 1 are shuffled Spark task scheduling and the execution method of transmission, it is characterized in that, in described step 1 to 15, an arbitrary step is made mistakes and all can be triggered fault-tolerance mechanism: if mistake betides the predecessor task of the transmission of shuffling of a streamlined, namely shuffle the step of mapping tasks, and the either step before this step, so his follow-up all can be marked as failure and resubmit, and continues executing with the transmission of shuffling of streamlined；If mistake betides the subsequent tasks of the transmission of shuffling of a streamlined, the i.e. execution step of stipulations task, so predecessor task can't be affected, and the subsequent tasks of failure can be resubmited, the subsequent tasks now resubmited will go to read required data from the disk of predecessor task.

3. streamlined data according to claim 1 are shuffled Spark task scheduling and the execution method of transmission, it is characterized in that, in described step 3, scheduler performs node to the free time that explorer request is identical with task number, takes the random strategy performing node obtaining the free time.

4. streamlined data according to claim 1 are shuffled Spark task scheduling and the execution method of transmission, it is characterised in that transmission flowing water information of shuffling in described step 3, rely on the information of the execution node set of ID and correspondence including transmission of shuffling.

5. streamlined data according to claim 4 are shuffled Spark task scheduling and the execution method of transmission, it is characterized in that, the information of described execution node set comprises: this performs conventions data block number and the metamessage that node is responsible for, and this metamessage includes nodename and address.

6. streamlined data according to claim 1 are shuffled Spark task scheduling and the execution method of transmission, it is characterized in that, in described step 3 after registration is complete, scheduler is also notified that each node that performs that namely will run this task gets out corresponding internal memory and carrys out the intermediate object program that their predecessor task of buffer memory sends, and this information comprises: the ID of transmission that shuffles specifically, forerunner shuffle the conventions data block number and conventions data block sum that the piecemeal sum of mapping tasks, this this subtask of execution node be responsible for.