CN105718244B - A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission - Google Patents
A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission Download PDFInfo
- Publication number
- CN105718244B CN105718244B CN201610029211.7A CN201610029211A CN105718244B CN 105718244 B CN105718244 B CN 105718244B CN 201610029211 A CN201610029211 A CN 201610029211A CN 105718244 B CN105718244 B CN 105718244B
- Authority
- CN
- China
- Prior art keywords
- task
- transmission
- shuffling
- data
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
- G06F9/3869—Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
Abstract
Shuffled the invention discloses a kind of streamlined data Spark task schedulings and the execution method of transmission, submit from back to front and perform the stage and task therein, implementing result is sent to the internal memory of subsequent tasks using predecessor task simultaneously, do not changing user interface, not while the integrality and fault-tolerance of failure stage, solve script Spark to shuffle in different phase (Stage) the disk read-write expense of data transfer (Shuffle), so as to reduce run time of the distributed-computation program on Spark.
Description
Technical field
The present invention relates to computer distribution type Computational frame field.Specifically, mainly in distributed computing framework
Change his Task Scheduling Mechanism on the basis of Spark, so as to lift the performance of the Computational frame.
Background technology
Spark has been deployed in countless data centers as current most widely used distributed computing framework
In.The distributed rebound data collection (Resilient Distributed Dataset, RDD) that it is proposed causes the meter of big data
Calculation process maximum possible is carried out in internal memory.In execution logic, Spark is generated from front to back according to the logic of user program
RDD, each RDD can have the dependence of oneself.When the program of user needs final output result, Spark will be from last
One RDD recurrence is found forward, and relies on (Shuffle Dependency) according to transmission of shuffling present in it to divide
Stage (Stage).After the stage has been divided, Spark will presentation stage from front to back, first submit the rank without missing dependence
Section, successively backward.This scheduling logic causes the calculative position of data automatic stream, and allows and calculate intermediate result maximum
It is possible to be stored in internal memory.
However, to ensure that the fault-tolerance of the segmentation and framework itself between the stage (Stage), is dividing each stage
When transmission of shuffling relies on (Shuffle Dependency), intermediate result storage caused by prodromic phase can be arrived disk by Spark
In, then start to distribute the task in next stage, the data gone again by the task in stage in long-range reading disk afterwards, then
Calculated.
It is much more slowly than in current disk speed in the reality of internal memory, the reading and writing data of this part becomes limitation Spark
The maximum bottleneck that can be lifted.But for fear of the integrality and fault-tolerance of the division in stage, at present still not for optimizing this
The patch or solution of part bottleneck occur.
The content of the invention
The present invention is directed to the bottleneck of the reading writing harddisk when Spark data shuffle transmission, it is proposed that a kind of streamlined data are washed
The Spark method for scheduling task of board transmission (Shuffle), by the submission order for changing Spark tasks so that a task elder generation
Start scheduled distribution in his predecessor task, while implementing result is sent into the internal memory of subsequent tasks using predecessor task
Method, do not change user interface, not while the integrality and fault-tolerance of failure stage, solve script Spark in difference
The disk read-write expense of data transfer of being shuffled in stage (Stage) (Shuffle).So that predecessor task is in the generation performed
Between result while subsequent tasks are sent the data to by network, so as to avoid the read-write of disk I/O, lifting Spark distributions
The performance of formula Computational frame.
The technical solution of the present invention is as follows:
A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission, are comprised the following steps:
Step 1:When Spark, which submits a task and the task to be divided into multiple stages, to be submitted, find first
User performs the last stage of task generation result;
Step 2:Since the last stage, judge the stage whether comprising unfinished prodromic phase:
If the prodromic phase in this stage all performs completion, the stage is submitted to be performed;
If prodromic phase is not performed, then by the phased markers to wait, while the stage is submitted to be performed,
And recurrence submits the prodromic phase in the stage;
Step 3:Submitting after a stage performed, scheduler into multiple tasks, and judges the stage resolutions
The stage:Whether it is loitering phase:
If the stage is marked as waiting, scheduler asks to hold with the task number identical free time to explorer
Row node, after scheduler obtains the corresponding execution node for performing task, the distributed rebound data collection that is included according to the stage
Dependence recurrence find forward shuffle transmission rely on, scheduler often find one shuffle transmission rely on will to mapping it is defeated
Go out trace table register this time shuffle transmission flowing water information, after registration is complete, scheduler is also notified that each will be transported
The execution node of this task of row gets out corresponding internal memory to cache the intermediate result that their predecessor tasks are sent;It is each to perform
After node receives the log-on message of scheduler, meeting relies on ID as index for newly-built one in local cache using transmission of shuffling, value
For the key-value pair of the total several caching array of conventions data block, at the same can also local newly-built one using shuffle transmission dependence ID as
Index, be worth the key-value pair of the semaphore data structure for conventions data block sum, wherein each semaphore include it is current shuffle according to
Bad duty mapping sum of shuffling,;
Otherwise, it is directly entered in next step;
Step 4:In the set of tasks of scheduler encapsulated phase, judge whether the stage is a mapping phase of shuffling:
If the stage is a mapping phase of shuffling, corresponding biography of shuffling is set to each task in the stage
Defeated dependence ID;
Otherwise, it is directly entered in next step;
Step 5:Packaged task is distributed to each execution node by scheduler;
Step 6:When task, which is assigned to, to be performed on the actuator of each execution node, actuator can judge this
Whether task is mapping tasks of shuffling:
If it is, the transmission of shuffling included according to the task relies on ID, ask the ID corresponding to mapping output trace table
Stipulations task execution node aggregate information, then, the corresponding stipulations information of mapping tasks of shuffling, the knot that will be received are set
Close the conventions data block number in information and remote address is packaged into a Hash table and is transmitted to the mapping tasks of shuffling, and enter step
7;
If actuator judges that the task for stipulations task, is called function corresponding to the stipulations task to be calculated, gone forward side by side
Enter step 11;
Step 7:When a mapping tasks of shuffling start to perform, streamlined data output can be checked the need for;
If desired, the grader for grader or the Spark acquiescence specified first according to user is by intermediate result key assignments
To calculating conventions data block number corresponding to him according to key, according to the data block number of setting and remote address Hash table, will calculate
The data result of acquisition is sent to the corresponding execution node for being responsible for follow-up stipulations task, and the information of transmission includes:Shuffle transmission
Rely on ID, conventions data block number, the key-value pair of data result;While data are sent, actuator writes data to disk, and
Into step 8;Meanwhile after the execution node for being responsible for stipulations task receives pipelined data, the data dependence ID that will can shuffle makees
For index, preserve in the caching of the conventions data block number of caching array corresponding to the ID, into step 8;
If it is not required, then implementing result is directly write into disk, into step 8;
Step 8:Actuator completes mapping tasks of shuffling;
Step 9:When one shuffle mapping tasks end of run when, will be to the execution node of all responsible stipulations tasks
Flowing water ending message is sent, the information includes:Transmission of shuffling relies on ID, the responsible mapping data block number of the task, and this bar letter
The responsible conventions data block number of node is performed corresponding to breath;
Step 10:When be responsible for stipulations task executions node receive the information that flowing water terminates after, can according to shuffle transmit according to
ID is relied to find semaphore array corresponding to the ID as index, wherein conventions data block number CountDownLatch is subtracted
One.If this semaphore is reduced to 0, then it represents that the data that the conventions data block relies on map whole ends of transmission;
Step 11:When actuator performs the specified function of stipulations task, corresponding stipulations function can be called, the function exists
When execution task reads data, an iterator for reading in data can be asked to node is performed;
Step 12:It can inquire whether current transmission of shuffling has local cache to node is performed when iterator is generated,
I.e. whether by streamlined data transfer:
If it is, calling performs the acquisition caching method of node, ID is relied on according to the transmission of shuffling of stipulations task and is somebody's turn to do
The responsible conventions data block number of task caches to node request is performed, and enters step 13;
Otherwise, teledata is read, into step 15;
Step 13:After performing the calling that node receives acquisition caching, ID will be relied on using transmission of shuffling as index, is found slow
Corresponding caching array in depositing, and return to the asynchronous reference of conventions data block number caching;
Step 14:When iterator receives the asynchronous reference of caching, start waiting for, until the transmission of shuffling of the required by task
CountDownLatch semaphores in the conventions data block number of dependence are changed into 0, represent the mapping that the conventions data block is relied on
Data block is fully completed, into step 15;
Step 15:Actuator, which performs, specifies stipulations function.
Compared with prior art, the beneficial effects of the invention are as follows:Data during Spark tasks carryings can significantly be shortened
Shuffle time of transmission, shorten the execution time of distributed computing task.
Brief description of the drawings
Fig. 1 configuration diagrams
Fig. 2 scheduler schedules method flow diagrams
Fig. 3 perform node flow chart
Fig. 4 flowing water log-on messages
Fig. 5 flowing water announcement informations
Fig. 6 flowing water transmits information
Fig. 7 flowing water ending messages
Fig. 8 perform nodal cache framework
Fig. 9 perform node signal control framework
Specific implementation method
Embodiments of the invention are elaborated below with reference to accompanying drawing.The present embodiment is in technical solution of the present invention and calculation
Implemented on the premise of method, and provide detailed embodiment and specific operation process, but be applicable platform and be not limited to following realities
Apply example.As long as the Spark of cluster compatibility master just can run the present invention.The concrete operations platform of this example is by two common clothes
It is engaged in the small-sized cluster of device composition, UbuntuServer 14.04.1LTS 64bit is housed on each server, and be equipped with 8GB
Internal memory.The specific exploitation of the present invention is the source code version based on Apache Spark 1.5.
As shown in figure 1, the present invention includes scheduler (DAGScheduler), resource management based on the original frameworks of Spark
Device (BlockMangerMaster), mapping output trace table (MapOutputTracker) and execution node
(BlockManger) and actuator (Executor), rank is not destroyed by changing task scheduling algorithm and performing flow to realize
The streamlined of data transfer of shuffling on the basis of the completion of section and fault-tolerance.Scheduler and execution node are respectively according to accompanying drawing
2 with accompanying drawing 3 in process flow operation, so as to realize the lifting of Distributed Calculation performance.
The Spark for including the present invention is all deployed on every server, wherein a server is as Spark clusters
Master, in addition Slave of the machine as cluster.It is worth noting that in order to ensure the performance of the present invention, the collection of deployment
Group's response configures the internal memory bigger than original Spark clusters, and specific memory size is depending on the data volume size of operation.
Distributed Calculation application can be run after disposing and completing according to the Spark method of operation.Change is for making
User with Spark Computational frames is fully transparent.
A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission, are comprised the following steps:
Step 1:When Spark, which submits a task and the task to be divided into multiple stages, to be submitted, find first
User performs the last stage of task generation result;
Step 2:Since the last stage, judge the stage whether comprising unfinished prodromic phase:
If the prodromic phase in this stage all performs completion, the stage is submitted to be performed;
If prodromic phase is not performed, then by the phased markers to wait, while the stage is submitted to be performed,
And recurrence submits the prodromic phase in the stage;
Step 3:Submitting after a stage performed, scheduler into multiple tasks, and judges the stage resolutions
The stage:Whether it is loitering phase:
If the stage is marked as waiting, scheduler asks to hold with the task number identical free time to explorer
Row node, after scheduler obtains the corresponding execution node for performing task, the distributed rebound data collection that is included according to the stage
Dependence recurrence find forward shuffle transmission rely on, scheduler often find one shuffle transmission rely on will to mapping it is defeated
Go out trace table register this time shuffle transmission flowing water information, after registration is complete, scheduler is also notified that each will be transported
The execution node of this task of row gets out corresponding internal memory to cache the intermediate result that their predecessor tasks are sent;It is each to perform
After node receives the log-on message of scheduler, meeting relies on ID as index for newly-built one in local cache using transmission of shuffling, value
For the key-value pair of the total several caching array of conventions data block, at the same can also local newly-built one using shuffle transmission dependence ID as
Index, be worth the key-value pair of the semaphore data structure for conventions data block sum, wherein each semaphore include it is current shuffle according to
Bad duty mapping sum of shuffling,;
Otherwise, it is directly entered in next step;
Step 4:In the set of tasks of scheduler encapsulated phase, judge whether the stage is a mapping phase of shuffling:
If the stage is a mapping phase of shuffling, corresponding biography of shuffling is set to each task in the stage
Defeated dependence ID;
Otherwise, it is directly entered in next step;
Step 5:Packaged task is distributed to each execution node by scheduler;
Step 6:When task, which is assigned to, to be performed on the actuator of each execution node, actuator can judge this
Whether task is mapping tasks of shuffling:
If it is, the transmission of shuffling included according to the task relies on ID, ask the ID corresponding to mapping output trace table
Stipulations task execution node aggregate information, then, the corresponding stipulations information of mapping tasks of shuffling, the knot that will be received are set
Close the conventions data block number in information and remote address is packaged into a Hash table and is transmitted to the mapping tasks of shuffling, and enter step
7;
If actuator judges that the task for stipulations task, is called function corresponding to the stipulations task to be calculated, gone forward side by side
Enter step 11;
Step 7:When a mapping tasks of shuffling start to perform, streamlined data output can be checked the need for;
If desired, the grader for grader or the Spark acquiescence specified first according to user is by intermediate result key assignments
To calculating conventions data block number corresponding to him according to key, according to the data block number of setting and remote address Hash table, will calculate
The data result of acquisition is sent to the corresponding execution node for being responsible for follow-up stipulations task, and the information of transmission includes:Shuffle transmission
Rely on ID, conventions data block number, the key-value pair of data result;While data are sent, actuator writes data to disk, and
Into step 8;Meanwhile after the execution node for being responsible for stipulations task receives pipelined data, the data dependence ID that will can shuffle makees
For index, preserve in the caching of the conventions data block number of caching array corresponding to the ID, into step 8;
If it is not required, then implementing result is directly write into disk, into step 8;
Step 8:Actuator completes mapping tasks of shuffling;
Step 9:When one shuffle mapping tasks end of run when, will be to the execution node of all responsible stipulations tasks
Flowing water ending message is sent, the information includes:Transmission of shuffling relies on ID, the responsible mapping data block number of the task, and this bar letter
The responsible conventions data block number of node is performed corresponding to breath;
Step 10:When be responsible for stipulations task executions node receive the information that flowing water terminates after, can according to shuffle transmit according to
ID is relied to find semaphore array corresponding to the ID as index, wherein conventions data block number CountDownLatch is subtracted
One.If this semaphore is reduced to 0, then it represents that the data that the conventions data block relies on map whole ends of transmission;
Step 11:When actuator performs the specified function of stipulations task, corresponding stipulations function can be called, the function exists
When execution task reads data, an iterator for reading in data can be asked to node is performed;
Step 12:It can inquire whether current transmission of shuffling has local cache to node is performed when iterator is generated,
I.e. whether by streamlined data transfer:
If it is, calling performs the acquisition caching method of node, ID is relied on according to the transmission of shuffling of stipulations task and is somebody's turn to do
The responsible conventions data block number of task caches to node request is performed, and enters step 13;
Otherwise, teledata is read, into step 15;
Step 13:After performing the calling that node receives acquisition caching, ID will be relied on using transmission of shuffling as index, is found slow
Corresponding caching array in depositing, and return to the asynchronous reference of conventions data block number caching;
Step 14:When iterator receives the asynchronous reference of caching, start waiting for, until the transmission of shuffling of the required by task
CountDownLatch semaphores in the conventions data block number of dependence are changed into 0, represent the mapping that the conventions data block is relied on
Data block is fully completed, into step 15;
Step 15:Actuator, which performs, specifies stipulations function.
When performing above step, one step error of any of which can all trigger fault-tolerance mechanism:If mistake occurs
In the predecessor task of the transmission of shuffling of a streamlined, that is, shuffle mapping tasks the step of, and any step before the step
Suddenly, then his follow-up transmission of shuffling that can be all marked as failure and resubmit, continue executing with streamlined;If mistake
Betide the execution step of the subsequent tasks of the transmission of shuffling of a streamlined, i.e. stipulations task, then predecessor task can't
It is affected, and the subsequent tasks of failure can be resubmited, the subsequent tasks now resubmited will be from predecessor task
Go to read required data in disk.
Because the present embodiment transmits data to etc. the section of pending stipulations task while performing and shuffling mapping tasks
Point, so as to which shuffle stand-by period of transmission of script Spark be stashed.At the end of mapping tasks of shuffling, stipulations task is just
It can start within relatively original shorter time, so as to accelerate the speed of whole Distributed Calculation.
By the related Spark such as Word Count benchmark programs on the basis of this embodiment, this is demonstrated
The correctness of invention, while Spark of the present invention compared to master in performance has not in different benchmark programs
With the lifting of degree.
Preferred embodiment of the invention described in detail above.It should be appreciated that the ordinary skill of this area is without wound
The property made work can makes many modifications and variations according to the design of the present invention.Therefore, all technician in the art
Pass through the available technology of logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea
Scheme, all should be in the protection domain being defined in the patent claims.
Claims (6)
- The Spark task schedulings of transmission and execution method 1. a kind of streamlined data are shuffled, it is characterised in that including following step Suddenly:Step 1:When Spark, which submits a task and the task to be divided into multiple stages, to be submitted, user is found first Execution task generates the last stage of result;Step 2:Since the last stage, judge the stage whether comprising unfinished prodromic phase:If the prodromic phase in this stage all performs completion, the stage is submitted to be performed;If prodromic phase is not performed, then by the phased markers to wait, while the stage is submitted to be performed, and Recurrence submits the prodromic phase in the stage;Step 3:Submitting after a stage performed, the stage resolutions into multiple tasks, and are judged the rank by scheduler Section:Whether it is loitering phase:If the stage is marked as waiting, scheduler performs section to explorer request and task number identical free time Point, after scheduler obtains the corresponding execution node for performing task, the distributed rebound data collection that is included according to the stage according to Rely relation recurrence to find transmission of shuffling forward to rely on, scheduler often finds a transmission dependence of shuffling will be to mapping output tracking Table register this time shuffle transmission Dependency Specification, after registration is complete, scheduler is also notified that each will run this The execution node of task gets out corresponding internal memory to cache the intermediate result that their predecessor tasks are sent;Each node that performs is received To after the log-on message of scheduler, meeting relies on ID as index for newly-built one in local cache using transmission of shuffling, and is worth for stipulations The key-value pair of the caching array of data block total number, while also ID can be relied on as index using transmission of shuffling at local newly-built one, value For conventions data block sum semaphore data structure key-value pair, wherein each semaphore include specifically shuffle transmission rely on Duty mapping of shuffling sum,;Otherwise, it is directly entered in next step;Step 4:In the set of tasks of scheduler encapsulated phase, judge whether the stage is a mapping phase of shuffling:If the stage is a mapping phase of shuffling, each task in the stage is set corresponding to shuffle transmission according to Rely ID;Otherwise, it is directly entered in next step;Step 5:Packaged task is distributed to each execution node by scheduler;Step 6:When task, which is assigned to, to be performed on the actuator of each execution node, actuator can judge this task Whether it is mapping tasks of shuffling:If it is, the transmission of shuffling included according to the task relies on ID, rule corresponding to the ID are asked to mapping output trace table The about aggregate information of the execution node of task, then, sets the corresponding stipulations information for mapping tasks of shuffling, the set received is believed Conventions data block number and remote address in breath are packaged into a Hash table and are transmitted to the mapping tasks of shuffling, and enter step 7;If actuator judges that the task for stipulations task, calls function corresponding to the stipulations task to be calculated, and enters step Rapid 11;Step 7:When a mapping tasks of shuffling start to perform, streamlined data output can be checked the need for;If desired, specified first according to user grader or Spark acquiescence grader by intermediate result key-value pair by Conventions data block number corresponding to him is calculated according to key, according to the data block number of setting and remote address Hash table, is obtained calculating Data result be sent to corresponding to be responsible for the execution node of follow-up stipulations task, the information of transmission includes:Transmission of shuffling relies on ID, conventions data block number, the key-value pair of data result;While data are sent, actuator writes data to disk, and enters Step 8;Meanwhile after the execution node for being responsible for stipulations task receives pipelined data, the transmission that will can shuffle relies on ID as rope Draw, be stored in the transmission of shuffling and rely in the caching of the conventions data block number of caching array corresponding to ID, into step 8;If it is not required, then implementing result is directly write into disk, into step 8;Step 8:Actuator completes mapping tasks of shuffling;Step 9:When one shuffle mapping tasks end of run when, will to all responsible stipulations tasks execution node send Flowing water ending message, the information include:Transmission of shuffling relies on ID, the responsible mapping data block number of the task, and this information pair That answers performs the responsible conventions data block number of node;Step 10:After the execution node for being responsible for stipulations task receives the information that flowing water terminates, ID can be relied on according to transmission of shuffling As index, find the transmission of shuffling and rely on semaphore array corresponding to ID, will wherein conventions data block counter number CountDownLatch subtracts one, if this counter number is reduced to 0, then it represents that the data mapping that the conventions data block relies on Whole ends of transmission;Step 11:When actuator performs the specified function of stipulations task, corresponding stipulations function can be called, the function is performing When task reads data, an iterator for reading in data can be asked to node is performed;Step 12:Can be to performing whether the current transmission of shuffling of node inquiry has local cache when iterator is generated It is no by streamlined data transfer:If it is, calling the acquisition caching method for performing node, ID and the task are relied on according to the transmission of shuffling of stipulations task Responsible conventions data block number caches to node request is performed, and enters step 13;Otherwise, teledata is read, into step 15;Step 13:After performing the calling that node receives acquisition caching, ID will be relied on using transmission of shuffling as index, is found in caching Corresponding caching array, and return to the asynchronous reference of conventions data block number caching;Step 14:When iterator receives the asynchronous reference of caching, start waiting for, until the transmission of shuffling of the required by task relies on Conventions data block number in CountDownLatch semaphores be changed into 0, represent the mapping data that the conventions data block is relied on Block is fully completed, into step 15;Step 15:Actuator, which performs, specifies stipulations function.
- 2. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In any step, which malfunctions, in described step 1 to 15 can all trigger fault-tolerance mechanism:If mistake betides a flowing water The predecessor task for the transmission of shuffling changed, that is, shuffle mapping tasks the step of, and the either step before the step, then he It is follow-up to be all marked as failure and resubmit, continue executing with the transmission of shuffling of streamlined;If mistake betides one The execution step of the subsequent tasks of the transmission of shuffling of streamlined, i.e. stipulations task, then predecessor task can't be affected, and The subsequent tasks of failure can be resubmited, and the subsequent tasks now resubmited will go to read from the disk of predecessor task Required data.
- 3. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In scheduler is taken and obtained at random to explorer request and the idle execution node of task number identical in described step 3 Take the strategy of the execution node of free time.
- 4. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In transmission flowing water information of being shuffled in described step 3, including transmission of shuffling rely on ID and the corresponding letter for performing node set Breath.
- 5. streamlined data according to claim 4 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In the packet of described execution node set contains:This performs the responsible conventions data block number and metamessage of node, this yuan letter Breath includes nodename and address.
- 6. streamlined data according to claim 1 are shuffled, the Spark task schedulings of transmission and execution method, its feature exist In in described step 3 after registration is complete, scheduler is also notified that each execution section that will run this task Point gets out corresponding internal memory to cache the intermediate result that their predecessor tasks are sent, and the intermediate result includes:Current biography of shuffling Defeated dependence ID, forerunner shuffle the responsible conventions data block number of the piecemeal sums of mapping tasks, the execution node subtask and Conventions data block sum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610029211.7A CN105718244B (en) | 2016-01-18 | 2016-01-18 | A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610029211.7A CN105718244B (en) | 2016-01-18 | 2016-01-18 | A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105718244A CN105718244A (en) | 2016-06-29 |
CN105718244B true CN105718244B (en) | 2018-01-12 |
Family
ID=56147869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610029211.7A Active CN105718244B (en) | 2016-01-18 | 2016-01-18 | A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105718244B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106168963B (en) * | 2016-06-30 | 2019-06-11 | 北京金山安全软件有限公司 | Real-time streaming data processing method and device and server |
CN106371919B (en) * | 2016-08-24 | 2019-07-16 | 上海交通大学 | It is a kind of based on mapping-reduction computation model data cache method of shuffling |
CN107612886B (en) * | 2017-08-15 | 2020-06-30 | 中国科学院大学 | Spark platform Shuffle process compression algorithm decision method |
CN107885587B (en) * | 2017-11-17 | 2018-12-07 | 清华大学 | A kind of executive plan generation method of big data analysis process |
CN110083441B (en) * | 2018-01-26 | 2021-06-04 | 中兴飞流信息科技有限公司 | Distributed computing system and distributed computing method |
CN110750341B (en) * | 2018-07-24 | 2022-08-02 | 深圳市优必选科技有限公司 | Task scheduling method, device, system, terminal equipment and storage medium |
CN109951556A (en) * | 2019-03-27 | 2019-06-28 | 联想(北京)有限公司 | A kind of Spark task processing method and system |
CN110109747B (en) * | 2019-05-21 | 2021-05-14 | 北京百度网讯科技有限公司 | Apache Spark-based data exchange method, system and server |
CN110134714B (en) * | 2019-05-22 | 2021-04-20 | 东北大学 | Distributed computing framework cache index method suitable for big data iterative computation |
CN111061565B (en) * | 2019-12-12 | 2023-08-25 | 湖南大学 | Two-section pipeline task scheduling method and system in Spark environment |
CN111258785B (en) * | 2020-01-20 | 2023-09-08 | 北京百度网讯科技有限公司 | Data shuffling method and device |
CN113364603B (en) * | 2020-03-06 | 2023-05-02 | 华为技术有限公司 | Fault recovery method of ring network and physical node |
CN113495679B (en) * | 2020-04-01 | 2022-10-21 | 北京大学 | Optimization method for large data storage access and processing based on nonvolatile storage medium |
CN111782367B (en) * | 2020-06-30 | 2023-08-08 | 北京百度网讯科技有限公司 | Distributed storage method and device, electronic equipment and computer readable medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN104750482A (en) * | 2015-03-13 | 2015-07-01 | 合一信息技术(北京)有限公司 | Method for constructing dynamic script execution engine based on MapReduce |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9235446B2 (en) * | 2012-06-22 | 2016-01-12 | Microsoft Technology Licensing, Llc | Parallel computing execution plan optimization |
US9424274B2 (en) * | 2013-06-03 | 2016-08-23 | Zettaset, Inc. | Management of intermediate data spills during the shuffle phase of a map-reduce job |
-
2016
- 2016-01-18 CN CN201610029211.7A patent/CN105718244B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN104750482A (en) * | 2015-03-13 | 2015-07-01 | 合一信息技术(北京)有限公司 | Method for constructing dynamic script execution engine based on MapReduce |
Non-Patent Citations (2)
Title |
---|
Mammoth:Gearing Hadoop towards memory-intensive mapreduce applications;Xuanhua Shi等;《IEEE Transactions on Parallel and Distributed Systems》;20150801;第26卷(第8期);全文 * |
Virtual shuffling for efficient data movement in mapduce;Weikuan Yu等;《IEEE transactions on Computers》;20150228;第64卷(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN105718244A (en) | 2016-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105718244B (en) | A kind of streamlined data are shuffled Spark task schedulings and the execution method of transmission | |
US5960181A (en) | Computer performance modeling system and method | |
US7464208B2 (en) | Method and apparatus for shared resource management in a multiprocessing system | |
US20070204271A1 (en) | Method and system for simulating a multi-CPU/multi-core CPU/multi-threaded CPU hardware platform | |
CN109144710A (en) | Resource regulating method, device and computer readable storage medium | |
CN111930365B (en) | Qt-based application program rapid development framework, development method and operation method | |
CA3055071C (en) | Writing composite objects to a data store | |
CN101243396A (en) | Method and apparatus for supporting universal serial bus devices in a virtualized environment | |
CN113656227A (en) | Chip verification method and device, electronic equipment and storage medium | |
CN104937564B (en) | The data flushing of group form | |
CN107729050A (en) | Real-time system and task construction method based on LET programming models | |
DE102013209643A1 (en) | Mechanism for optimized message exchange data transfer between nodelets within a tile | |
CN111309649A (en) | Data transmission and task processing method, device and equipment | |
CN106250348A (en) | A kind of heterogeneous polynuclear framework buffer memory management method based on GPU memory access characteristic | |
US11275661B1 (en) | Test generation of a distributed system | |
CN108845829A (en) | Method for executing system register access instruction | |
US7216252B1 (en) | Method and apparatus for machine check abort handling in a multiprocessing system | |
EP2672388B1 (en) | Multi-processor parallel simulation method, system and scheduler | |
CN110235105A (en) | System and method for the client-side throttling after the server process in trust client component | |
Lázaro-Muñoz et al. | A tasks reordering model to reduce transfers overhead on GPUs | |
US20100161305A1 (en) | Performance evaluation device, performance evaluation method and simulation program | |
US10162913B2 (en) | Simulation device and simulation method therefor | |
JP2023544911A (en) | Method and apparatus for parallel quantum computing | |
CN110825461B (en) | Data processing method and device | |
DE102020126699A1 (en) | INITIALIZATION AND ADMINISTRATION OF CLASS OF SERVICE ATTRIBUTES DURING RUNNING IN ORDER TO OPTIMIZE DEEP LEARNING TRAINING IN DISTRIBUTED ENVIRONMENTS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |