CN105718244B

CN105718244B - A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission

Info

Publication number: CN105718244B
Application number: CN201610029211.7A
Authority: CN
Inventors: 付周望; 张未雨; 戚正伟; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-01-18
Filing date: 2016-01-18
Publication date: 2018-01-12
Anticipated expiration: 2036-01-18
Also published as: CN105718244A

Abstract

Shuffled the invention discloses a kind of streamlined data Spark task schedulings and the execution method of transmission, submit from back to front and perform the stage and task therein, implementing result is sent to the internal memory of subsequent tasks using predecessor task simultaneously, do not changing user interface, not while the integrality and fault-tolerance of failure stage, solve script Spark to shuffle in different phase (Stage) the disk read-write expense of data transfer (Shuffle), so as to reduce run time of the distributed-computation program on Spark.

Description

A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission

Technical field

The present invention relates to computer distribution type Computational frame field.Specifically, mainly in distributed computing framework Change his Task Scheduling Mechanism on the basis of Spark, so as to lift the performance of the Computational frame.

Background technology

Spark has been deployed in countless data centers as current most widely used distributed computing framework In.The distributed rebound data collection (Resilient Distributed Dataset, RDD) that it is proposed causes the meter of big data Calculation process maximum possible is carried out in internal memory.In execution logic, Spark is generated from front to back according to the logic of user program RDD, each RDD can have the dependence of oneself.When the program of user needs final output result, Spark will be from last One RDD recurrence is found forward, and relies on (Shuffle Dependency) according to transmission of shuffling present in it to divide Stage (Stage).After the stage has been divided, Spark will presentation stage from front to back, first submit the rank without missing dependence Section, successively backward.This scheduling logic causes the calculative position of data automatic stream, and allows and calculate intermediate result maximum It is possible to be stored in internal memory.

However, to ensure that the fault-tolerance of the segmentation and framework itself between the stage (Stage), is dividing each stage When transmission of shuffling relies on (Shuffle Dependency), intermediate result storage caused by prodromic phase can be arrived disk by Spark In, then start to distribute the task in next stage, the data gone again by the task in stage in long-range reading disk afterwards, then Calculated.

It is much more slowly than in current disk speed in the reality of internal memory, the reading and writing data of this part becomes limitation Spark The maximum bottleneck that can be lifted.But for fear of the integrality and fault-tolerance of the division in stage, at present still not for optimizing this The patch or solution of part bottleneck occur.

The content of the invention

The present invention is directed to the bottleneck of the reading writing harddisk when Spark data shuffle transmission, it is proposed that a kind of streamlined data are washed The Spark method for scheduling task of board transmission (Shuffle), by the submission order for changing Spark tasks so that a task elder generation Start scheduled distribution in his predecessor task, while implementing result is sent into the internal memory of subsequent tasks using predecessor task Method, do not change user interface, not while the integrality and fault-tolerance of failure stage, solve script Spark in difference The disk read-write expense of data transfer of being shuffled in stage (Stage) (Shuffle).So that predecessor task is in the generation performed Between result while subsequent tasks are sent the data to by network, so as to avoid the read-write of disk I/O, lifting Spark distributions The performance of formula Computational frame.

The technical solution of the present invention is as follows：

A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission, are comprised the following steps：

Step 1：When Spark, which submits a task and the task to be divided into multiple stages, to be submitted, find first User performs the last stage of task generation result；

Step 2：Since the last stage, judge the stage whether comprising unfinished prodromic phase：

If the prodromic phase in this stage all performs completion, the stage is submitted to be performed；

If prodromic phase is not performed, then by the phased markers to wait, while the stage is submitted to be performed, And recurrence submits the prodromic phase in the stage；

Step 3：Submitting after a stage performed, scheduler into multiple tasks, and judges the stage resolutions The stage：Whether it is loitering phase：

If the stage is marked as waiting, scheduler asks to hold with the task number identical free time to explorer Row node, after scheduler obtains the corresponding execution node for performing task, the distributed rebound data collection that is included according to the stage Dependence recurrence find forward shuffle transmission rely on, scheduler often find one shuffle transmission rely on will to mapping it is defeated Go out trace table register this time shuffle transmission flowing water information, after registration is complete, scheduler is also notified that each will be transported The execution node of this task of row gets out corresponding internal memory to cache the intermediate result that their predecessor tasks are sent；It is each to perform After node receives the log-on message of scheduler, meeting relies on ID as index for newly-built one in local cache using transmission of shuffling, value For the key-value pair of the total several caching array of conventions data block, at the same can also local newly-built one using shuffle transmission dependence ID as Index, be worth the key-value pair of the semaphore data structure for conventions data block sum, wherein each semaphore include it is current shuffle according to Bad duty mapping sum of shuffling,；

Otherwise, it is directly entered in next step；

Step 4：In the set of tasks of scheduler encapsulated phase, judge whether the stage is a mapping phase of shuffling：

If the stage is a mapping phase of shuffling, corresponding biography of shuffling is set to each task in the stage Defeated dependence ID；

Otherwise, it is directly entered in next step；

Step 5：Packaged task is distributed to each execution node by scheduler；

Step 6：When task, which is assigned to, to be performed on the actuator of each execution node, actuator can judge this Whether task is mapping tasks of shuffling：

If it is, the transmission of shuffling included according to the task relies on ID, ask the ID corresponding to mapping output trace table Stipulations task execution node aggregate information, then, the corresponding stipulations information of mapping tasks of shuffling, the knot that will be received are set Close the conventions data block number in information and remote address is packaged into a Hash table and is transmitted to the mapping tasks of shuffling, and enter step 7；

If actuator judges that the task for stipulations task, is called function corresponding to the stipulations task to be calculated, gone forward side by side Enter step 11；

Step 7：When a mapping tasks of shuffling start to perform, streamlined data output can be checked the need for；

If desired, the grader for grader or the Spark acquiescence specified first according to user is by intermediate result key assignments To calculating conventions data block number corresponding to him according to key, according to the data block number of setting and remote address Hash table, will calculate The data result of acquisition is sent to the corresponding execution node for being responsible for follow-up stipulations task, and the information of transmission includes：Shuffle transmission Rely on ID, conventions data block number, the key-value pair of data result；While data are sent, actuator writes data to disk, and Into step 8；Meanwhile after the execution node for being responsible for stipulations task receives pipelined data, the data dependence ID that will can shuffle makees For index, preserve in the caching of the conventions data block number of caching array corresponding to the ID, into step 8；

If it is not required, then implementing result is directly write into disk, into step 8；

Step 8：Actuator completes mapping tasks of shuffling；

Step 9：When one shuffle mapping tasks end of run when, will be to the execution node of all responsible stipulations tasks Flowing water ending message is sent, the information includes：Transmission of shuffling relies on ID, the responsible mapping data block number of the task, and this bar letter The responsible conventions data block number of node is performed corresponding to breath；

Step 10：When be responsible for stipulations task executions node receive the information that flowing water terminates after, can according to shuffle transmit according to ID is relied to find semaphore array corresponding to the ID as index, wherein conventions data block number CountDownLatch is subtracted One.If this semaphore is reduced to 0, then it represents that the data that the conventions data block relies on map whole ends of transmission；

Step 11：When actuator performs the specified function of stipulations task, corresponding stipulations function can be called, the function exists When execution task reads data, an iterator for reading in data can be asked to node is performed；

Step 12：It can inquire whether current transmission of shuffling has local cache to node is performed when iterator is generated, I.e. whether by streamlined data transfer：

If it is, calling performs the acquisition caching method of node, ID is relied on according to the transmission of shuffling of stipulations task and is somebody's turn to do The responsible conventions data block number of task caches to node request is performed, and enters step 13；

Otherwise, teledata is read, into step 15；

Step 13：After performing the calling that node receives acquisition caching, ID will be relied on using transmission of shuffling as index, is found slow Corresponding caching array in depositing, and return to the asynchronous reference of conventions data block number caching；

Step 14：When iterator receives the asynchronous reference of caching, start waiting for, until the transmission of shuffling of the required by task CountDownLatch semaphores in the conventions data block number of dependence are changed into 0, represent the mapping that the conventions data block is relied on Data block is fully completed, into step 15；

Step 15：Actuator, which performs, specifies stipulations function.

Compared with prior art, the beneficial effects of the invention are as follows：Data during Spark tasks carryings can significantly be shortened Shuffle time of transmission, shorten the execution time of distributed computing task.

Brief description of the drawings

Fig. 1 configuration diagrams

Fig. 2 scheduler schedules method flow diagrams

Fig. 3 perform node flow chart

Fig. 4 flowing water log-on messages

Fig. 5 flowing water announcement informations

Fig. 6 flowing water transmits information

Fig. 7 flowing water ending messages

Fig. 8 perform nodal cache framework

Fig. 9 perform node signal control framework

Specific implementation method

Embodiments of the invention are elaborated below with reference to accompanying drawing.The present embodiment is in technical solution of the present invention and calculation Implemented on the premise of method, and provide detailed embodiment and specific operation process, but be applicable platform and be not limited to following realities Apply example.As long as the Spark of cluster compatibility master just can run the present invention.The concrete operations platform of this example is by two common clothes It is engaged in the small-sized cluster of device composition, UbuntuServer 14.04.1LTS 64bit is housed on each server, and be equipped with 8GB Internal memory.The specific exploitation of the present invention is the source code version based on Apache Spark 1.5.

As shown in figure 1, the present invention includes scheduler (DAGScheduler), resource management based on the original frameworks of Spark Device (BlockMangerMaster), mapping output trace table (MapOutputTracker) and execution node (BlockManger) and actuator (Executor), rank is not destroyed by changing task scheduling algorithm and performing flow to realize The streamlined of data transfer of shuffling on the basis of the completion of section and fault-tolerance.Scheduler and execution node are respectively according to accompanying drawing 2 with accompanying drawing 3 in process flow operation, so as to realize the lifting of Distributed Calculation performance.

The Spark for including the present invention is all deployed on every server, wherein a server is as Spark clusters Master, in addition Slave of the machine as cluster.It is worth noting that in order to ensure the performance of the present invention, the collection of deployment Group's response configures the internal memory bigger than original Spark clusters, and specific memory size is depending on the data volume size of operation.

Distributed Calculation application can be run after disposing and completing according to the Spark method of operation.Change is for making User with Spark Computational frames is fully transparent.

Otherwise, it is directly entered in next step；

Step 5：Packaged task is distributed to each execution node by scheduler；

Step 8：Actuator completes mapping tasks of shuffling；

Otherwise, teledata is read, into step 15；

Step 15：Actuator, which performs, specifies stipulations function.

When performing above step, one step error of any of which can all trigger fault-tolerance mechanism：If mistake occurs In the predecessor task of the transmission of shuffling of a streamlined, that is, shuffle mapping tasks the step of, and any step before the step Suddenly, then his follow-up transmission of shuffling that can be all marked as failure and resubmit, continue executing with streamlined；If mistake Betide the execution step of the subsequent tasks of the transmission of shuffling of a streamlined, i.e. stipulations task, then predecessor task can't It is affected, and the subsequent tasks of failure can be resubmited, the subsequent tasks now resubmited will be from predecessor task Go to read required data in disk.

Because the present embodiment transmits data to etc. the section of pending stipulations task while performing and shuffling mapping tasks Point, so as to which shuffle stand-by period of transmission of script Spark be stashed.At the end of mapping tasks of shuffling, stipulations task is just It can start within relatively original shorter time, so as to accelerate the speed of whole Distributed Calculation.

By the related Spark such as Word Count benchmark programs on the basis of this embodiment, this is demonstrated The correctness of invention, while Spark of the present invention compared to master in performance has not in different benchmark programs With the lifting of degree.

Preferred embodiment of the invention described in detail above.It should be appreciated that the ordinary skill of this area is without wound The property made work can makes many modifications and variations according to the design of the present invention.Therefore, all technician in the art Pass through the available technology of logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Scheme, all should be in the protection domain being defined in the patent claims.

Claims

The Spark task schedulings of transmission and execution method 1. a kind of streamlined data are shuffled, it is characterised in that including following step Suddenly：

Step 1：When Spark, which submits a task and the task to be divided into multiple stages, to be submitted, user is found first Execution task generates the last stage of result；

Step 2：Since the last stage, judge the stage whether comprising unfinished prodromic phase：

If the prodromic phase in this stage all performs completion, the stage is submitted to be performed；

If prodromic phase is not performed, then by the phased markers to wait, while the stage is submitted to be performed, and Recurrence submits the prodromic phase in the stage；

Step 3：Submitting after a stage performed, the stage resolutions into multiple tasks, and are judged the rank by scheduler Section：Whether it is loitering phase：

If the stage is marked as waiting, scheduler performs section to explorer request and task number identical free time Point, after scheduler obtains the corresponding execution node for performing task, the distributed rebound data collection that is included according to the stage according to Rely relation recurrence to find transmission of shuffling forward to rely on, scheduler often finds a transmission dependence of shuffling will be to mapping output tracking Table register this time shuffle transmission Dependency Specification, after registration is complete, scheduler is also notified that each will run this The execution node of task gets out corresponding internal memory to cache the intermediate result that their predecessor tasks are sent；Each node that performs is received To after the log-on message of scheduler, meeting relies on ID as index for newly-built one in local cache using transmission of shuffling, and is worth for stipulations The key-value pair of the caching array of data block total number, while also ID can be relied on as index using transmission of shuffling at local newly-built one, value For conventions data block sum semaphore data structure key-value pair, wherein each semaphore include specifically shuffle transmission rely on Duty mapping of shuffling sum,；

Otherwise, it is directly entered in next step；

Step 4：In the set of tasks of scheduler encapsulated phase, judge whether the stage is a mapping phase of shuffling：

If the stage is a mapping phase of shuffling, each task in the stage is set corresponding to shuffle transmission according to Rely ID；

Otherwise, it is directly entered in next step；

Step 5：Packaged task is distributed to each execution node by scheduler；

Step 6：When task, which is assigned to, to be performed on the actuator of each execution node, actuator can judge this task Whether it is mapping tasks of shuffling：

If it is, the transmission of shuffling included according to the task relies on ID, rule corresponding to the ID are asked to mapping output trace table The about aggregate information of the execution node of task, then, sets the corresponding stipulations information for mapping tasks of shuffling, the set received is believed Conventions data block number and remote address in breath are packaged into a Hash table and are transmitted to the mapping tasks of shuffling, and enter step 7；

If actuator judges that the task for stipulations task, calls function corresponding to the stipulations task to be calculated, and enters step Rapid 11；

Step 7：When a mapping tasks of shuffling start to perform, streamlined data output can be checked the need for；

If desired, specified first according to user grader or Spark acquiescence grader by intermediate result key-value pair by Conventions data block number corresponding to him is calculated according to key, according to the data block number of setting and remote address Hash table, is obtained calculating Data result be sent to corresponding to be responsible for the execution node of follow-up stipulations task, the information of transmission includes：Transmission of shuffling relies on ID, conventions data block number, the key-value pair of data result；While data are sent, actuator writes data to disk, and enters Step 8；Meanwhile after the execution node for being responsible for stipulations task receives pipelined data, the transmission that will can shuffle relies on ID as rope Draw, be stored in the transmission of shuffling and rely in the caching of the conventions data block number of caching array corresponding to ID, into step 8；

If it is not required, then implementing result is directly write into disk, into step 8；

Step 8：Actuator completes mapping tasks of shuffling；

Step 9：When one shuffle mapping tasks end of run when, will to all responsible stipulations tasks execution node send Flowing water ending message, the information include：Transmission of shuffling relies on ID, the responsible mapping data block number of the task, and this information pair That answers performs the responsible conventions data block number of node；

Step 10：After the execution node for being responsible for stipulations task receives the information that flowing water terminates, ID can be relied on according to transmission of shuffling As index, find the transmission of shuffling and rely on semaphore array corresponding to ID, will wherein conventions data block counter number CountDownLatch subtracts one, if this counter number is reduced to 0, then it represents that the data mapping that the conventions data block relies on Whole ends of transmission；

Step 11：When actuator performs the specified function of stipulations task, corresponding stipulations function can be called, the function is performing When task reads data, an iterator for reading in data can be asked to node is performed；

Step 12：Can be to performing whether the current transmission of shuffling of node inquiry has local cache when iterator is generated It is no by streamlined data transfer：

If it is, calling the acquisition caching method for performing node, ID and the task are relied on according to the transmission of shuffling of stipulations task Responsible conventions data block number caches to node request is performed, and enters step 13；

Otherwise, teledata is read, into step 15；

Step 13：After performing the calling that node receives acquisition caching, ID will be relied on using transmission of shuffling as index, is found in caching Corresponding caching array, and return to the asynchronous reference of conventions data block number caching；

Step 14：When iterator receives the asynchronous reference of caching, start waiting for, until the transmission of shuffling of the required by task relies on Conventions data block number in CountDownLatch semaphores be changed into 0, represent the mapping data that the conventions data block is relied on Block is fully completed, into step 15；

Step 15：Actuator, which performs, specifies stipulations function.
2. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In any step, which malfunctions, in described step 1 to 15 can all trigger fault-tolerance mechanism：If mistake betides a flowing water The predecessor task for the transmission of shuffling changed, that is, shuffle mapping tasks the step of, and the either step before the step, then he It is follow-up to be all marked as failure and resubmit, continue executing with the transmission of shuffling of streamlined；If mistake betides one The execution step of the subsequent tasks of the transmission of shuffling of streamlined, i.e. stipulations task, then predecessor task can't be affected, and The subsequent tasks of failure can be resubmited, and the subsequent tasks now resubmited will go to read from the disk of predecessor task Required data.
3. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In scheduler is taken and obtained at random to explorer request and the idle execution node of task number identical in described step 3 Take the strategy of the execution node of free time.
4. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In transmission flowing water information of being shuffled in described step 3, including transmission of shuffling rely on ID and the corresponding letter for performing node set Breath.
5. streamlined data according to claim 4 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In the packet of described execution node set contains：This performs the responsible conventions data block number and metamessage of node, this yuan letter Breath includes nodename and address.
6. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In in described step 3 after registration is complete, scheduler is also notified that each execution section that will run this task Point gets out corresponding internal memory to cache the intermediate result that their predecessor tasks are sent, and the intermediate result includes：Current biography of shuffling Defeated dependence ID, forerunner shuffle the responsible conventions data block number of the piecemeal sums of mapping tasks, the execution node subtask and Conventions data block sum.