CN110262896A - A kind of data processing accelerated method towards Spark system - Google Patents

A kind of data processing accelerated method towards Spark system Download PDF

Info

Publication number
CN110262896A
CN110262896A CN201910467553.0A CN201910467553A CN110262896A CN 110262896 A CN110262896 A CN 110262896A CN 201910467553 A CN201910467553 A CN 201910467553A CN 110262896 A CN110262896 A CN 110262896A
Authority
CN
China
Prior art keywords
task
data
distribution
server
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910467553.0A
Other languages
Chinese (zh)
Inventor
赵来平
李一鸣
李克秋
苏丽叶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201910467553.0A priority Critical patent/CN110262896A/en
Publication of CN110262896A publication Critical patent/CN110262896A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5033Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5015Service provider selection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/504Resource capping

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of data processing accelerated method towards Spark system, it is made of three performance prediction module, task scheduling modules and data distribution module parts, performance prediction module models the performance of a task according to given parameter and predicts its deadline;It is executed in task scheduling modules distribution calculating task to server, obtains currently available computing resource in real time by resource monitor, then execute on given server by the routine interface distribution task of exploitation;The present invention is directed to accelerate the data processing of Spark system, by comprehensively consider hardware resource isomerism, calculate interference, data locality, data skew and data overflow and the factors such as write, optimize data distribution and task schedule, so that minimum is performed integrally the time.

Description

A kind of data processing accelerated method towards Spark system
Technical field
The present invention relates to the big data technical field of the task schedule of distributed computing and data distribution, especially a kind of face To the data processing accelerated method of Spark system.
Background technique
With the proposition of MapReduce computation module, processing and analysis for big data become extremely simple and efficient, But the person's of falling behind problem is all the thorny problem in distributed computing all the time.The person of falling behind refers to that those appoint what is run parallel It spends abnormal more time to complete in business and then significantly reduces the task of overall performance.Lead to the person of falling behind problem Producing reason master It will be from two aspects of hardware computing resource layer and application layer, such as hardware isomerism, calculating interference, data locality, number It is write with data skew etc. according to overflowing.The whole deadline has not only been dragged slowly in the presence of the person's of falling behind problem, and task is enabled to become low efficiency Under, the waste of hardware resource is caused, being normally carried out for other tasks is influenced.Meanwhile the long-play of the person's of falling behind problem The risk for increasing task error even results in entire work failure.
Currently, there are many methods to be proposed to alleviate the person's of falling behind problem to accelerate data processing, such as LATE, Dolly, delay dispatching, DREAMS, LIBRA etc., but these methods are all haveed the defects that more or less, are not comprehensively considered Various influence factors are accelerated.LATE optimizes the supposition execution mechanism of default so that it can be adapted for Heterogeneous Computing ring Border, Dolly using avoiding waiting for and guess to execute progress to the complete clone of small task, but LATE and Dolly these lead to Backup tasks are crossed to alleviate the method for the person's of falling behind problem, requiring waiting, the short time does not believe to collect the statistics of mission performance performance Breath is to generate strategy, and still the very slow task of operation is also possible to cause the waste of computing resource after those backups.Delay Scheduling waits for improving whole data local sex expression by allowing those not to be able to satisfy the task of data locality, Calculation scale size dynamic of the DREAMS based on each task is that they distribute different amounts of computing resource, although both methods The influence of application layer data is considered, but still can not solve the problems, such as the data skew bring person of falling behind.LIBRA passes through It supports the fractionation of key-value pair data to carry out special optimization to data skew, but does not account for calculating interference and task schedule band The influence come.
Existing method is not applied for current most popular distributed computing framework Spark mostly, in order to make up theirs Deficiency, the invention proposes a kind of data processing accelerated methods towards Spark system.Compared to existing work, the present invention is comprehensive Conjunction considers two aspects of hardware computing resource layer and application layer, based on hardware resource isomerism, calculates interference, data locality The factor for influencing mission performance with data skew etc. establishes mission performance prediction model for the influence degree of performance, proposes A kind of model solution algorithm that can be completed within grade time second, and then data distribution and task scheduling strategy are formulated, it minimizes whole Body Runtime, accelerates task.
Summary of the invention
The present invention is directed to accelerate the data processing of Spark system, by comprehensively considering hardware resource isomerism, calculating and do Disturb, data locality, data skew and data overflow and the factors such as write, optimize data distribution and task schedule, to minimize entirety Deadline.
In order to solve prior art problem, the present invention adopts the following technical scheme:
A kind of data processing accelerated method towards Spark system, by performance prediction module, task scheduling modules and data Three parts of distribution module form.
A kind of data processing accelerated method towards Spark system, by performance prediction module, task scheduling modules and data Three parts of distribution module form,
When performance prediction module models the performance of a task and predicts its completion according to given parameter Between;
It is executed in task scheduling modules distribution calculating task to server, i.e., by hardware isomerism and data locality factor The scheduling strategy for carrying out appointed task obtains currently available computing resource, then the journey by exploitation by resource monitor in real time Sequence interface assignment task executes on given server;
Data distribution module is that task is generated to the intermediate data of (key, value) key-value pair format according to certain rule It is distributed to the process of different Reduce task processing;Wherein:
1) data distribution module uses based on bi-distribution to sampling of data and has recorded each key in each Map Estimated distribution in task;Whole distribution can be estimated with smaller cost by being sampled analysis for data, and then formulate number According to distribution policy;
2) data distribution module just distributes data using the form of record partitioning list;
Data distribution after sampling analysis is sorted by key, then evenly distributes them to each task, record by 2-1 Under each task pending data starting key and terminate key formed partition list BK;
2-2, the key that judgement is not belonging to BK can find the subregion that oneself will be assigned by comparison, it is only necessary to record Save the detailed distribution policy of key in BK;
2-3 determines that BK obtains scheduling result using task scheduling modules later and prepares for distribution data in next step;
3) data distribution module distributes data again;
3-1 according to data locality principle, first finds and the Map task part Map to be allocated of the key in BK The target Reduce task of operation on the same server is allocated;
3-2 is allocated remaining part according still further to first-fit algorithm;When being allocated every time, data distribution Module is according to the time upper limit t being arrangedmaxAnd performance prediction module carries out data cutting, it is ensured that each Reduce task is used When be no more than tmax
3-3 judges that all Reduce tasks are assigned and still have data redundancy, then increase tmaxRestart to distribute, so Iteration is until all key are assigned in BK.
The task scheduling modules are according to first scheduling Map task, then dispatch the progress of Reduce task, include the following steps:
1) the upper limit of the number of task can be distributed for every server determination, the upper limit by server performance indicator h itself and Total task number amount codetermines, the ratio that the ratio of the task quantity of each server is their h;
2) destination server for determining each task, is dispatched first, in accordance with data locality principle, and priority scheduling task arrives Pending data institute on the server, if it find that destination server predistribution task has reached the upper limit, then waits next round Scheduling;
Second wheel scheduling uses first-fit algorithm, and not yet dispatching for task is assigned to the clothes that the upper limit of the number has not yet been reached It is engaged on device;
3) computing resource unit is distributed, after the destination server for determining task, needs to distribute specific computing resource list Task is given in position.
Beneficial effect
Existing method is not applied for current most popular distributed computing framework Spark mostly, in order to make up theirs Deficiency, the invention proposes a kind of data processing accelerated methods towards Spark system.Compared to existing work, the present invention is comprehensive Conjunction considers two aspects of hardware computing resource layer and application layer, based on hardware resource isomerism, calculates interference, data locality The factor for influencing mission performance with data skew etc. establishes mission performance prediction model for the influence degree of performance, proposes A kind of model solution algorithm that can be completed within grade time second, and then data distribution and task scheduling strategy are formulated, it minimizes whole Body Runtime, accelerates task.
Compared with prior art, the present invention has comprehensively considered the hardware resource layer and application layer for influencing distributed computing task Factor, compensate for the deficiency of the relevant technologies, can be used for accelerating the data processing of the Spark system under various scenes.
Detailed description of the invention
Fig. 1 is system architecture diagram of the invention.
Fig. 2 is the schematic diagram that data are just distributed.
Fig. 3 is the schematic diagram of task schedule.
Fig. 4 is the schematic diagram that data are distributed again.
Specific embodiment
For make the present invention introduced method and technology, reach purpose and effect be more readily understood it is clear, below with reference to specific Embodiment and attached drawing are further illustrated the present invention.
As shown in Figure 1, the present invention provides a kind of data processing accelerated method towards Spark system, by performance prediction mould Three block, task scheduling modules and data distribution module parts form.
1. performance prediction module
When performance prediction module models the performance of a task and predicts its completion according to given parameter Between.Performance prediction module is called by other modules, and is fed back to scheduling scheme, and the original of time is performed integrally according to minimum Then, scheduling scheme is constantly adjusted.Performance mainly includes following part:
1) data transmission period, i.e. task were read the time required for data to be processed.Because reading local data The time of (data storage and task run are on same server) spends can be ignored substantially, so when data are transmitted Between be essentially identical to task and read the time quantum t of partial data being stored on other serversc, there is tc=(ss*se)/bs, Middle ssRepresent the number of data for needing remotely to read, seRepresent the size of every data, bsIt is data transmission bauds.This part Time loss is the reflection of influence of the data locality for performance.
1-1) reading data number ssIt is by data distribution strategy decision, this module feeds back data distribution module, The data volume for assisting adjustment task that will remotely read.
1-2) the size s of every dataeIt is to be determined by Stored Data Type and every data comprising data volume.
1-3) data transmission bauds bsDifference according to data locality is different.Task read local data substantially without Time loss, and the data transmission bauds in the case that the task of secondary level-one is located at data with different server in rack depends on Performance in the case where the standard of interchanger, different racks is worst.
2) task computation time, i.e. CPU operation time.Task handles time t required for datacBy the size of data with And processing speed determines, and data processing speed is mainly influenced by the factors such as server performance itself and calculating interference, institute To there is tc=ξ/(h*f), wherein ξ represents task pending data amount, and h represents server process data under noiseless situation Speed, f represent the interference coefficient under different situations.This portion of time consumption combine hardware isomerism and calculate interference because Element.
2-1) pending data amount ξ is by data distribution strategy decision, this module feeds back data distribution module, Assist adjustment task by data volume to be processed
2-2) the data processing speed h of server is that its performance reflects under noiseless situation, the clothes of different hardware resource The computing capability of business device tested by benchmark test and obtained by difference, this parameter, can be used for solving the person of falling behind under isomerous environment Problem.
Interference 2-3) is calculated to refer to causing application performance to decline because of resource contention when multitask and machine operation.Either The task of different processes still operates in the task in different virtual machines, all can be because of preempting resources on hardware resource layer And compete with one another for, it can not even accomplish completely to be isolated at present, so the case where interference not can avoid, establishing interference model can be with The preferably performance under the operation of assessment multi-task parallel.Based on actual experiment as a result, this module establishes interference model f=c0 +c1* S/D, c0And c1It is the constant term coefficient in model, S and D are the CPU number that server can supply and task needs respectively Amount.F is smaller to mean that annoyance level is more serious.
2. task scheduling modules
The task schedule of MapReduce frame executes, the task schedule machine of default in distribution calculating task to server System pays the utmost attention to the data locality for meeting task, but is easy to produce considerable task in this way and is intensively allocated on certain server The case where, to aggravate resource contention, performance is caused to decline.The case where this mechanism does not consider hardware isomery simultaneously, exacerbates The person's of falling behind problem under complicated calculations cluster environment.This module synthesis considers hardware isomerism and data locality because usually specifying The scheduling strategy of task obtains currently available computing resource, then the routine interface by exploitation by resource monitor in real time Distribution task executes on given server, to realize more flexible scheduling, cooperates with other modules and alleviates the person's of falling behind problem. Specific step is as follows:
4) the upper limit of the number of task can be distributed for every server determination.The upper limit by server performance indicator h itself and Total task number amount codetermines, and the ratio that the ratio of the task quantity of each server is their h, i.e. performance is better, can distribute Task is more, can reduce in this way low performance server task it is excessive caused by the person's of falling behind problem.
5) destination server of each task is determined.It is dispatched first, in accordance with data locality principle, priority scheduling task arrives Pending data institute on the server, if it find that destination server predistribution task has reached the upper limit, then waits next round Scheduling.Second wheel scheduling uses first-fit algorithm, and not yet dispatching for task is assigned to the service that the upper limit of the number has not yet been reached On device.
6) computing resource unit is distributed.After the destination server for determining task, need to distribute specific computing resource list Task is given in position, such as the container in scheduling of resource frame YARN, encapsulates the resources such as a certain number of CPU, memory, disk for operation Task uses in this embodiment.
3. data distribution module
The shuffle stage of MapReduce frame is in (key, value) the key-value pair format for generating Map task Between data the process of different Reduce tasks processing is distributed to according to certain rule.Every data is according to key's under default situations Cryptographic Hash and general assignment number modulus determine target Reduce task, but the data that mechanism processing key distribution frequency differs greatly When, the case where excessive data are distributed to same task, i.e. data skew problem can be generated, this is the important of mission performance decline One of source.This module is based on performance prediction module, in conjunction with task schedule as a result, comprehensively considering influences each of performance Factor, constantly adjusts data distribution strategy, and final realize minimizes Reduce task total runtime, accelerate data processing. Specific step is as follows:
4) sampling of data.Whole distribution can be estimated with smaller cost by being sampled analysis for data, and then be formulated Data distribution strategy.This module uses the methods of sampling based on bi-distribution, and sampling is all that primary independent uncle exerts every time Benefit test, i.e., the probability whether each data is drawn is all independent and remains unchanged.In order to support more fine-grained number Estimated distribution of each key in each Map task is had recorded according to distribution and the needs of data transmission factors, this module.
5) data are just distributed.In order to avoid record each key detailed distribution policy great expense incurred, this module using note The form of partition list is recorded, the data distribution after sampling analysis sorts by key first, then evenly distributes them to each Task finally records the starting key of each task pending data and terminates key formation partition list BK.In this mechanism Under, the key for being not belonging to BK can find the subregion that oneself will be assigned by comparison, it is only necessary to which record saves the key in BK Detailed distribution policy.Determine that BK obtains scheduling result using task scheduling modules later and does standard for distribution data in next step It is standby.
6) data are distributed again.Dividing in detail for part of each key in each Map task in BK is formulated in detail in this part Hair strategy.For the part Map to be allocated of the key in BK, according to data locality principle, first find and the Map task run Target Reduce task on the same server is allocated, and remaining part is allocated according still further to first-fit algorithm. When being allocated every time, the present invention can be according to the time upper limit t being arrangedmaxAnd performance prediction module carries out data cutting, really Each Reduce task used time is protected no more than tmax.If all Reduce tasks, which are assigned, still data redundancy, increase tmax Restart to distribute, such iteration is until all key are assigned in BK.
Its detailed step is as follows:
1. preparation work
Need first to obtain the parameters of performance prediction module before invention.
1) operation benchmark obtains the data processing performance h of each server in computing cluster.
2) change on the server that the number of tasks of operation simulates different degrees of calculating interference scenario simultaneously, obtain multiple groups H uses f=c in least square method scheduling algorithm fitting interference model0+c1* the c of S/D0And c1Parameter.
3) test obtains the data transmission bauds b in computing cluster between servers
2. sampling of data simultaneously determines partition list
Before Map task starts, pending data is sampled, count the distribution of key and Map task is estimated with this Distribution { the u of the intermediate result of outputk, wherein ukKey is represented as the expected quantity of the data of k, while flag data will held Detailed distribution in capable each Map task.
After the completion of statistics, as shown in Fig. 2, according to key to { ukIn element be ranked up, then data are sequentially allocated To each Reduce task, it is ensured that each task computation amount is identical (calculation amount is equal to data volume under normal circumstances), obtains record The list BK of the key of task partitioning boundary.Such as k2 and k4, data volume itself is larger, to avoid data skew problem, needs to cut And it is distributed to multiple adjacent R educe tasks and is handled.
3. data are just distributed
It, such as can be easy by that can determine the unique objects Reduce task of the key in non-BK with the boundary key comparison in BK Reduce2 can only be distributed to by obtaining k2.By data ukAfter being assigned to corresponding Reduce task, it can be counted according to performance prediction module The expected time of each Reduce task after the just sub-distribution of calculating data, while counting being distributed as current each task data The teledata read volume s of next step task schedule and each task of statisticssIt prepares.
4. task schedule
Map task is first dispatched, then dispatches Reduce task.
1) data locality.As shown in Fig. 3, the input of Map task for the first time is from HDFS distributed file system System, file are stored in the form of multiple block blocks of files in HDFS.Each Map task handles a block, each block There is identical backup on multiple servers, Map task run can meet highest on the server where corresponding backup The data locality of rank.Map task completes the intermediate data that output will be sent to each Reduce task, meets and minimizes The server that Reduce task remotely reads data can meet the maximum data locality of the task, at the beginning of the present invention selects data Partial data distribution after distribution count the prior node of determining Reduce task.
2) task threshold.Being calculated according to the ratio of the performance h of each available server ought to divide on each server More tasks are disposed on the better server of the number of tasks matched, i.e. performance.
3) task is distributed.As shown in figure 3, being preferentially assigned to firstly for each task i and meeting task the maximum data local Property server on run, if prior node is assigned number of tasks and has reached threshold value, task i waits next round scheduling (such as Task3).Is assigned to by first by i and is looked into according to first-fit algorithm by remaining each task i after first round scheduling It is run on the not up to server of threshold value found.
The interference coefficient f that each server can be known after the completion of task schedule and each task after data just distribution Teledata read volume ss
5. data are distributed again
Process is as shown in Figure 4.
1) for any key k in BK, the set of the Map task comprising it is searched.
2) each Map task in traversal set, if there is target Reduce task i can meet highest data local Property, that is, it runs on the same server, then the intermediate result that this Map task generates preferentially preferentially is distributed into i.It is pre- according to performance Survey module and preassigned tmaxThe at most assignable data volume of current task i is calculated, then to the data of Map task Carry out cutting distribution.
3) unallocated complete Map task in traversal set is successively attempted to carry out all non-local target Reduce tasks Data distribution is no more than t according to each task scheduled timemaxPrinciple cutting data.
4) if the data of k illustrate current t there are also residue after distributing twicemaxIt is too small, it resets data and increases tmaxAfterwards Restart data distributing step again from step 1, iteration is until all k distribute and finish.
It should be pointed out that for those of ordinary skill in the art, without departing from the inventive concept of the premise, Various modifications and improvements can be made, and these are all within the scope of protection of the present invention.Therefore, the scope of protection of the patent of the present invention It should be determined by the appended claims.

Claims (2)

1. a kind of data processing accelerated method towards Spark system, by performance prediction module, task scheduling modules and data point Three parts of module are sent out to form, it is characterised in that:
Performance prediction module models the performance of a task according to given parameter and predicts its deadline;
It is executed in task scheduling modules distribution calculating task to server, i.e., by hardware isomerism and data locality because usually referring to The scheduling strategy for determining task obtains currently available computing resource by resource monitor in real time, then the program by developing connects Mouth distribution task executes on given server;
Data distribution module is that task is generated to the intermediate data of (key, value) key-value pair format according to certain rule distribution To the process of different Reduce tasks processing;Wherein:
1) data distribution module uses based on bi-distribution to sampling of data and has recorded each key in each Map task In estimated distribution;Whole distribution can be estimated with smaller cost by being sampled analysis for data, and then formulate data point Hair strategy;
2) data distribution module just distributes data using the form of record partitioning list;
Data distribution after sampling analysis is sorted by key, then evenly distributes them to each task, recorded every by 2-1 The starting key and termination key of a task pending data form partition list BK;
2-2, the key that judgement is not belonging to BK can find the subregion that oneself will be assigned by comparison, it is only necessary to which record saves The detailed distribution policy of key in BK;
2-3 determines that BK obtains scheduling result using task scheduling modules later and prepares for distribution data in next step;
3) data distribution module distributes data again;
3-1 according to data locality principle, first finds and the Map task run part Map to be allocated of the key in BK Target Reduce task on the same server is allocated;
3-2 is allocated remaining part according still further to first-fit algorithm;When being allocated every time, data distribution module According to the time upper limit t being arrangedmaxAnd performance prediction module carries out data cutting, it is ensured that each Reduce task used time is not More than tmax
3-3 judges that all Reduce tasks are assigned and still have data redundancy, then increase tmaxRestart to distribute, such iteration Until all key are assigned in BK.
2. a kind of data processing accelerated method towards Spark system according to claim 1, it is characterised in that: described Task scheduling modules are according to first scheduling Map task, then dispatch the progress of Reduce task, include the following steps:
1) the upper limit of the number of task can be distributed for every server determination, the upper limit is by server performance indicator h itself and task Total quantity codetermines, the ratio that the ratio of the task quantity of each server is their h;
2) destination server for determining each task is dispatched first, in accordance with data locality principle, and priority scheduling task is to wait locate It manages data institute on the server, if it find that destination server predistribution task has reached the upper limit, then waits next round to dispatch; Second wheel scheduling uses first-fit algorithm, and not yet dispatching for task is assigned on the server that the upper limit of the number has not yet been reached;
3) distribute computing resource unit, after the destination server for determining task, need to distribute specific computing resource unit to Task.
CN201910467553.0A 2019-05-31 2019-05-31 A kind of data processing accelerated method towards Spark system Pending CN110262896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910467553.0A CN110262896A (en) 2019-05-31 2019-05-31 A kind of data processing accelerated method towards Spark system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910467553.0A CN110262896A (en) 2019-05-31 2019-05-31 A kind of data processing accelerated method towards Spark system

Publications (1)

Publication Number Publication Date
CN110262896A true CN110262896A (en) 2019-09-20

Family

ID=67916087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910467553.0A Pending CN110262896A (en) 2019-05-31 2019-05-31 A kind of data processing accelerated method towards Spark system

Country Status (1)

Country Link
CN (1) CN110262896A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094155A (en) * 2019-12-23 2021-07-09 中国移动通信集团辽宁有限公司 Task scheduling method and device under Hadoop platform
CN114385336A (en) * 2021-12-27 2022-04-22 同济大学 Anti-interference scheduling method and device for flow big data processing task
CN114880272A (en) * 2022-03-31 2022-08-09 深圳清华大学研究院 Optimization method and application of global height degree vertex set communication
CN114880272B (en) * 2022-03-31 2024-06-07 深圳清华大学研究院 Optimization method and application of global height degree vertex set communication

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808334A (en) * 2016-03-04 2016-07-27 山东大学 MapReduce short job optimization system and method based on resource reuse
CN108572873A (en) * 2018-04-24 2018-09-25 中国科学院重庆绿色智能技术研究院 A kind of load-balancing method and device solving the problems, such as Spark data skews
CN109376012A (en) * 2018-10-10 2019-02-22 电子科技大学 A kind of self-adapting task scheduling method based on Spark for isomerous environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808334A (en) * 2016-03-04 2016-07-27 山东大学 MapReduce short job optimization system and method based on resource reuse
CN108572873A (en) * 2018-04-24 2018-09-25 中国科学院重庆绿色智能技术研究院 A kind of load-balancing method and device solving the problems, such as Spark data skews
CN109376012A (en) * 2018-10-10 2019-02-22 电子科技大学 A kind of self-adapting task scheduling method based on Spark for isomerous environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李巧巧: "面向负载均衡的Spark任务划分与调度策略研究", 《信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094155A (en) * 2019-12-23 2021-07-09 中国移动通信集团辽宁有限公司 Task scheduling method and device under Hadoop platform
CN114385336A (en) * 2021-12-27 2022-04-22 同济大学 Anti-interference scheduling method and device for flow big data processing task
CN114880272A (en) * 2022-03-31 2022-08-09 深圳清华大学研究院 Optimization method and application of global height degree vertex set communication
CN114880272B (en) * 2022-03-31 2024-06-07 深圳清华大学研究院 Optimization method and application of global height degree vertex set communication

Similar Documents

Publication Publication Date Title
CN107888669B (en) Deep learning neural network-based large-scale resource scheduling system and method
CN107038069B (en) Dynamic label matching DLMS scheduling method under Hadoop platform
Chang et al. Scheduling in mapreduce-like systems for fast completion time
US9471390B2 (en) Scheduling mapreduce jobs in a cluster of dynamically available servers
US20200219028A1 (en) Systems, methods, and media for distributing database queries across a metered virtual network
CN111381950A (en) Task scheduling method and system based on multiple copies for edge computing environment
CN110737529A (en) cluster scheduling adaptive configuration method for short-time multiple variable-size data jobs
CN103729246B (en) Method and device for dispatching tasks
CN107357652B (en) Cloud computing task scheduling method based on segmentation ordering and standard deviation adjustment factor
CN104657220A (en) Model and method for scheduling for mixed cloud based on deadline and cost constraints
CN108469988A (en) A kind of method for scheduling task based on isomery Hadoop clusters
CN103699433B (en) One kind dynamically adjusts number of tasks purpose method and system in Hadoop platform
CN103401939A (en) Load balancing method adopting mixing scheduling strategy
CN112685153A (en) Micro-service scheduling method and device and electronic equipment
CN108845874A (en) The dynamic allocation method and server of resource
CN102207883A (en) Transaction scheduling method of heterogeneous distributed real-time system
US20130268941A1 (en) Determining an allocation of resources to assign to jobs of a program
Pongsakorn et al. Container rebalancing: Towards proactive linux containers placement optimization in a data center
CN107766140B (en) Schedulability analysis method for real-time task with preemption point
US6278963B1 (en) System architecture for distribution of discrete-event simulations
CN106991006A (en) Support the cloud workflow task clustering method relied on and the time balances
CN112015765B (en) Spark cache elimination method and system based on cache value
CN109710372A (en) A kind of computation-intensive cloud workflow schedule method based on cat owl searching algorithm
CN110262896A (en) A kind of data processing accelerated method towards Spark system
Hu et al. Minimizing resource consumption cost of DAG applications with reliability requirement on heterogeneous processor systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190920

WD01 Invention patent application deemed withdrawn after publication