CN109144707A - A kind of unbalanced method of processing big data platform Spark data distribution - Google Patents

A kind of unbalanced method of processing big data platform Spark data distribution Download PDF

Info

Publication number
CN109144707A
CN109144707A CN201710456187.XA CN201710456187A CN109144707A CN 109144707 A CN109144707 A CN 109144707A CN 201710456187 A CN201710456187 A CN 201710456187A CN 109144707 A CN109144707 A CN 109144707A
Authority
CN
China
Prior art keywords
task
data block
remaining runtime
data
remaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710456187.XA
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Zhongke Cluster Information Technology Co., Ltd.
Original Assignee
He Majun
Huang Chaojie
Ren Xiaoqin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by He Majun, Huang Chaojie, Ren Xiaoqin filed Critical He Majun
Priority to CN201710456187.XA priority Critical patent/CN109144707A/en
Publication of CN109144707A publication Critical patent/CN109144707A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Abstract

The embodiment of the invention discloses a kind of processing big data platform Spark data to distribute unbalanced method, is related to cluster resource scheduling, load balancing field.The data skew problem that the present invention occurs for task in each Stage of Spark, proposes the solution of MRFair, the method includes the steps: (1) active task is set by some tasks on each Stage of Spark;(2) remaining runtime of task is estimated according to the metadata of active task;(3) active task is arranged according to the remaining runtime descending of task, the available block of the maximum task of remaining runtime is reassigned to the smallest task of remaining runtime;(4) remaining runtime of more new task repeats the maximum residual runing time that step (3) are less than other tasks until being scheduled or redistributing the remaining runtime of rear task without available data block.The present invention can effectively reduce the total complete time of Spark operation, promote the service quality of Spark operation.

Description

A kind of unbalanced method of processing big data platform Spark data distribution
Technical field
The present invention relates to online cluster resource dispatching technique field more particularly to a kind of processing big data platform Spark numbers According to the unbalanced method of distribution.
Background technique
Spark is that the memory for carrying out distributed treatment to mass data in a manner of reliable, efficient, telescopic calculates Frame.The main component deployment of Spark cluster be divided into Spark Client, SparkContext, ClusterManager, Worker and Executor etc., as shown in Figure 1.Spark Client submits application program to Spark cluster for user, and SparkContext is communicated for being responsible for ClusterManager, and the application of resource, the distribution of task and monitoring are carried out Deng the life cycle management of responsible job execution.ClusterManager provides the distribution and management of resource, in different fortune Under row mode, the role served as is different, is provided under the operational modes such as Local, Spark Standalone by Master, It is provided under YARN operational mode by Resource Manager, is born under Mesos operational mode by Mesos Manager Duty.After operation of the SparkContext to operation is divided and distribute resource, task can be sent on Worker node Executor run.
The programming model of Spark is elasticity distribution formula data set (Resilient Distributed Dataset, RDD), It is the extension of MapReduce model, efficiently carries out data sharing in the parallel computation stage.Spark application program is mainly The operation that sequence of operations based on RDD is constituted, these operation operators are broadly divided into conversion operation and action operation, conversion operation It is that delay executes, only occurs the submission that action operation just triggers operation.
Most importantly two schedulers of DAGScheduler and TaskScheduler in Spark job scheduling, In, DAGScheduler is responsible for the logic scheduling of task, is the high-level scheduler towards scheduling phase, operation is split into not There is the task-set TaskSet of dependence with the stage, and the scheduling that TaskScheduler is responsible for specific tasks executes.Spark Application program constructs directed acyclic graph (Directed Acyclic Graph, DAG) according to the dependence between RDD, DAGScheduler is responsible for parsing DAG, and DAG is split into complementary scheduling phase.Each scheduling phase includes One or more tasks, these tasks form task-set TaskSet, submit to bottom scheduler by DAGScheduler TaskScheduler is scheduled execution.DAGScheduler monitors traffic control phase process, if some scheduling phase is transported Row failure, then need to resubmit the scheduling phase.
TaskScheduler receives the task-set sended over from DAGScheduler, and TaskScheduler, which is received, to be appointed The Executor for being responsible for task to be distributed to one by one Spark cluster Worker node after business collection goes to run.If some task is transported Row failure, TaskScheduler are responsible for retrying.If TaskScheduler has found some task, not running is complete always, it is possible to Start the same same task of task run, which task, which has first been run, just uses the result of which task.Worker node In Executor receive the task that TaskScheduler is sended over after, run in a manner of multithreading, each thread is negative A task is blamed, returns result to TaskScheduler after task run.
Summary of the invention
The technical problem to be solved by the present invention is it is unbalanced to provide a kind of processing big data platform Spark data distribution Method can reduce the data skew degree of task, accelerate the speed of performing task.
In order to solve the above technical problems, in a first aspect, the embodiment of the invention provides a kind of processing big data platform Spark Data distribute unbalanced method, and the method includes following four big steps:
(1) active task is set by some tasks on each Stage of Spark;
(2) remaining runtime of task is estimated according to the metadata of active task;
(3) active task is arranged according to the remaining runtime descending of task, by the maximum task of remaining runtime Available block is reassigned to the smallest task of remaining runtime;
(4) remaining runtime of more new task, repetition step (3) without available data block until being scheduled or again The remaining runtime of task is less than the maximum residual runing time of other tasks after distribution.
According in a first aspect, in the first possible implementation, tasks some on each scheduling phase of Spark are arranged For active task, active task quantity is m, m≤min (s, n), wherein s is total CPU core number in cluster, and n is the number of partitions of RDD Or scheduling phase total task number, i.e., the quantity of active task is no more than CPU core number, the number of partitions of RDD or scheduling rank in cluster Minimum value in section total task number;Active task cannot be actively out of service when task execution closes to an end, and only receives Could be out of service when message of ceasing and desisting order, this is achieved in that the untreatment data block of time longest task will be executed It is assigned to other task executions.
According in a first aspect, in the second possible implementation, at most one task of selection is allocated every time, because Occupy what the task that another has been completed discharged to divide data block in a task and being transmitted to other tasks Idle cpu resource, at the same the effect for redistributing two tasks only do not redistribute a task effect it is good, redistribute One task allows the task sufficiently to use remaining resource;MRFair Master estimates that the remaining of all tasks executes Time, and selection remaining time longest task is redistributed when detecting;Master will guarantee that remaining time is enough to count Operation plan is calculated, if MRFair determines to redistribute the available block in task T, needs to meet following two condition:
(1) at least one CPU core is idle to system at present;
(2) it redistributes only to add in the task T original execution time greater than the execution time after redistributing and divide Extra time expense with needs is just significant;
For long-time execute task, the influence of data skew be it is very serious, redistribute task additionally open Pin can be ignored substantially.
According in a first aspect, in the third possible implementation, after user submits operation, just starting simply by the presence of not Scheduled task, MRFair call tradition Schedule Backend module schedules to distribute task;If Master node handle All tasks all dispatching distributions are over, and MRFair is activated by the detection module in MRFair Worker, appoint to what is be carrying out It is engaged in carrying out tilt detection based on the remaining runtime value estimated, as shown in Figure 2;If the remaining runtime value of task exists Significant differences (divide remaining task be valuable) report that the information gives MRFair Master, and by the task ID and It estimates in the Hash list in remaining runtime value deposit MRFair Master node, waits subsequent processing.
According in a first aspect, in the fourth possible implementation, when Master node is run according to the residue of task When time descending arranges active task, need the longest task T1 operation suspension of remaining runtime, i.e. MRFair Master Notice MRFair Worker suspends the execution of the task, and captures the position for the input data that it is finally handled and allow to jump Cross previous processed input data;Can not or it be difficult to stop if the longest task T1 of remaining runtime is in a kind of State Master node or reselect the task T2 of a remaining runtime vice-minister and right then the request fails The task carries out the above processing, or if the longest task T1 of this remaining runtime is the last one task in operation In the case of, the subregion and task T1 of re-executing is fully entered, re-executes the longest task T1 of remaining runtime again The supposition implementation strategy fully entered just as Spark.
According in a first aspect, in a fifth possible implementation, selecting data block rather than byte is as data weight The unit of distribution, because the input data for scanning according to byte one by one and dividing task will lead to the long-time of Executor Obstruction, this cannot be tolerated;In order to which with the metadata of less cost maintenance data block, MRFair attempts fair scheduling Data block, although the size of each data block is unequal;In order to reach target, Master collects all task and its data block Metadata, and store data in local, be then based on the strategy of runing time to determine which task is preferentially scheduled;It is fixed Adopted data block has following five kinds of states:
(1) LocalFetched: data block is located locally node;
(2) RemoteFetchWaiting: data block is located at remote node, at the same etc. to be sent obtain asking for data block It asks;
(3) RemoteFetching: data block is located at remote node, while having initiated to obtain the request of data block;
(4) RemoteFetched: data block is located at remote node, while having obtained data block to locally;
(5) Used: data block is used.
According in a first aspect, in a sixth possible implementation, the remaining runtime estimation of task is to determine such as What redistributes an important factor for available block;If the state of data block be LocalFetched or RemoteFetched, then the remaining runtime of the data block be the data block size divided by data block in local Executor On calculating speed;If the state of data block is RemoteFetchWaiting or RemoteFetchWaiting, at this time If data block is on the Executor of local, the remaining runtime of the data block is that the data block size exists divided by data block Calculating speed on local Executor;Otherwise, the remaining runtime of the data block is the data block size divided by data block Local Executor calculating speed plus the data block size divided by data block from distal end Executor to local The speed of download of Executor;The remaining runtime of data block in other states is 0;According to the possessed data block of task Metadata, it can be deduced that the remaining runtime of entire task.
According in a first aspect, the longest task of remaining runtime is preferentially weighed in the 7th kind of possible implementation New distribution redistributes algorithm using the metadata tasks and remaining runtime prediction device θ of all active tasks as defeated Enter;Remaining runtime prediction device θ estimates processing time of the data block on specific Executor according to statistical data, It suppose there is m movable tasks, metadata tasks={ task1,task2,...,taskm, redistribute algorithm basis first The remaining runtime descending of active task arranges active task taskf> taski> ... > taskt(f,i,...t∈[1, M]), then choose the longest task task of remaining runtimef, its all not processed data block is filtered, result is saved Into a list blocks, if there is the available data block of n block, then blocks={ block1,block2,...,blockn, Algorithm traverses all available data blocks, then data block is passed to the task task being initially completedt, final updating task is surplus The Hash table of remaining runing time, tasks list of resequencing, and the data block being scheduled is removed from blocks list, weight The multiple process, until meeting one of following two condition:
(1) blocks=Φ;
(2)
That is, the remaining runtime of rear task can be scheduled or redistributed without available data block less than other tasks Maximum residual runing time.
Second aspect, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced dress It sets, including dispatching device described in any possible implementation of first aspect or first aspect.
The third aspect, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced function Consume reduction method, which is characterized in that the Spark group system uses first aspect or any possible realization of first aspect Method described in mode is scheduled.
Detailed description of the invention
Fig. 1 is the system architecture diagram of the Spark cluster of an embodiment of the present invention;
Fig. 2 is the execution flow chart of the MRFair system detection inclination task of an embodiment of the present invention;
Fig. 3 is the flow chart of the MRFair system data equilibrium assignment method of an embodiment of the present invention;
Fig. 4 is the sample explanatory diagram of an embodiment of the present invention;
Specific embodiment
Below according to drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.
As shown in figure 3, the embodiment of the invention provides a kind of processing big data platform Spark data to distribute unbalanced side Method, the method comprising the steps of:
S101. active task is set by some tasks on each Stage of Spark.
S102. the remaining runtime of task is estimated according to the metadata of active task.
S103. active task is arranged according to the remaining runtime descending of task, by the maximum task of remaining runtime Available block be reassigned to the smallest task of remaining runtime.
S104. the remaining runtime of more new task, repeat step S103 until without available data block it is scheduled or The remaining runtime for redistributing rear task is less than the maximum residual runing time of other tasks.
It will be understood by those skilled in the art that the serial number size of each step is not in the method for various embodiments of the present invention Mean the successive of execution sequence, the execution sequence of each step should be determined by its function and internal logic, without coping with the present invention The implementation process of specific embodiment constitutes any restriction.
The Spark collection of the embodiment of the invention also provides a kind of dispatching device shown in Fig. 3 including the embodiment of the present invention Group's system, the group system can be disposed according to framework shown in FIG. 1, which can be task dispatcher shown in Fig. 1.
Various embodiments of the present invention are further illustrated below by way of specific example:
Sample explanatory diagram is detailed in Fig. 4, using default Spark Standalone operational mode, wherein Spark's is same There are four tasks T1, T2, T3, T4 in stage, and current cluster environment there are four available resources Slot1, Slot2, Slot3,Slot4.When initial, it is 4, MRFair system in (the i.e. Task T1 completion of t1 moment that current active task number, which is arranged, When) estimation tasks remaining runtime, detect that Task T2 is the longest task of remaining runtime, and there are serious Data skew enters data to influence caused by slowing down inclination by the way that subregion T2 again is untreated;In fact, MRFair is not only By untreated T2 data again subregion to Slot 1 and Slot 2, Slot 3 is equally given, Slot 3 is completed in Task T3 Shi Chengwei idling-resource, the MRFair untreated Task T2 related data of subregion again, thus can greatly utilize Resource accelerates the execution speed of task, reduces total complete time;Task T2a, T2b, the T2c redistributed, it is referred to as former not locate The pre-segmentation task (abbreviation pre-segmentation task) of task is managed, and to be dispatched by the scheduling mode of longest processing time priority; The execution of MRFair circulation " estimate, the untreatment data block of remaining runtime longest task divides again by tasks leave runing time With " this strategy, until all task executions finish;At the t2 moment, MRFair has found T4 for next remaining runtime most Long task reduces the influence of data skew by subregion task T4 residue untreatment data again.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage and be situated between In matter, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be Magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims (8)

1. a kind of processing big data platform Spark data distribute unbalanced method, which is characterized in that the method includes steps It is rapid:
(1) active task is set by some tasks on each Stage of Spark;
(2) remaining runtime of task is estimated according to the metadata of active task;
(3) active task is arranged according to the remaining runtime descending of task, by the available of the maximum task of remaining runtime Data block is reassigned to the smallest task of remaining runtime;
(4) remaining runtime of more new task repeats step (3) until being scheduled or redistributing without available data block The remaining runtime of task is less than the maximum residual runing time of other tasks afterwards.
2. the method according to claim 1, wherein setting living for some tasks on each scheduling phase of Spark Dynamic task, active task quantity are m, m≤min (s, n), wherein s is total CPU core number in cluster, and n is the number of partitions or tune of RDD Phased mission sum is spent, i.e. the quantity of active task is appointed no more than CPU core number, the number of partitions of RDD or scheduling phase in cluster Minimum value in business sum;Active task cannot be actively out of service when task execution closes to an end, and only receives stopping Could be out of service when command messages, this is achieved in that the distribution of the untreatment data block of time longest task will be executed To other task executions.
3. the method according to claim 1, wherein MRFair every time at most select a task be allocated, Because dividing data block in a task and being transmitted to other tasks and occupy another task release completed Idle cpu resource, while the effect for redistributing two tasks only do not redistribute a task effect it is good, divide again The task is allowed sufficiently to use remaining resource with a task;MRFair Master estimates that the residue of all tasks is held The row time, and selection remaining time longest task is redistributed when detecting;Master will guarantee that remaining time is enough Operation plan is calculated, if MRFair determines to redistribute the available block in task T, needs to meet following two item Part:
(1) at least one CPU core is idle to system at present;
(2) redistributing only needs in the execution time that the task T original execution time is greater than after redistributing plus distribution The extra time expense wanted is just significant;
For long-time execute task, the influence of data skew be it is very serious, redistribute the overhead base of task Originally it can be ignored.
4. the method according to claim 1, wherein just starting simply by the presence of not adjusted after user submits operation The task of degree, MRFair call tradition Schedule Backend module schedules to distribute task;If Master node is all Task all dispatching distributions be over, MRFair is activated by the detection module in MRFair Worker, to the task base being carrying out Tilt detection is carried out in the remaining runtime value estimated, as shown in Figure 2;If the remaining runtime value of task exists serious Difference (it is valuable for dividing remaining task), reports that the information gives MRFair Master, and by the task ID and estimate Remaining runtime value is stored in the Hash list in MRFair Master node, waits subsequent processing.
5. the method according to claim 1, wherein when Master node is dropped according to the remaining runtime of task When sequence arranges active task, need the longest task T1 operation suspension of remaining runtime, i.e. MRFair Master notice MRFair Worker suspends the execution of the task, and captures the position for the input data that it is finally handled and allow to skip elder generation Preceding processed input data;If the longest task T1 of remaining runtime is in a kind of shape that can not or be difficult to stop State Master node or reselects the task T2 of a remaining runtime vice-minister and to this then the request fails Business carries out the above processing, or if the longest task T1 of this remaining runtime is the feelings of the last one task in operation Under shape, the subregion and task T1 of re-executing is fully entered, re-executes the complete of the longest task T1 of remaining runtime again Portion inputs the supposition implementation strategy just as Spark.
6. the method according to claim 1, wherein selecting data block rather than byte is as data Redistribution Unit, because the long-time that the input data that task is scanned and divided according to byte one by one will lead to Executor is blocked, This cannot be tolerated;In order to which with the metadata of less cost maintenance data block, MRFair attempts fair scheduling data Block, although the size of each data block is unequal;In order to reach target, Master collects the member of all task and its data block Data, and local is stored data in, the strategy of runing time is then based on to determine which task is preferentially scheduled;Define number There is following five kinds of states according to block:
(1) LocalFetched: data block is located locally node;
(2) RemoteFetchWaiting: data block is located at remote node, at the same etc. the request to be sent for obtaining data block;
(3) RemoteFetching: data block is located at remote node, while having initiated to obtain the request of data block;
(4) RemoteFetched: data block is located at remote node, while having obtained data block to locally;
(5) Used: data block is used.
7. the method according to claim 1, wherein the remaining runtime estimation of task is determining how again An important factor for distributing available block;If the state of data block is LocalFetched or RemoteFetched, should The remaining runtime of data block is the calculating speed of the data block size divided by data block on local Executor;If number State according to block is RemoteFetchWaiting or RemoteFetchWaiting, if data block is in local at this time On Executor, then the remaining runtime of the data block be the data block size divided by data block on local Executor Calculating speed;Otherwise, the remaining runtime of the data block be the data block size divided by data block local Executor's Calculating speed is plus the data block size divided by data block from distal end Executor to the speed of download of local Executor;It is in The remaining runtime of the data block of other states is 0;According to the metadata of the possessed data block of task, it can be deduced that entire to appoint The remaining runtime of business.
8. the method according to claim 1, wherein the longest task of remaining runtime is preferentially divided again Match, redistributes algorithm using the metadata tasks and remaining runtime prediction device θ of all active tasks as input;It is surplus Remaining runing time prediction device θ estimates processing time of the data block on specific Executor, it is assumed that have according to statistical data M movable tasks, metadata tasks={ task1,task2,...,taskm, it redistributes algorithm and is appointed first according to activity The remaining runtime descending of business arranges active task taskf> taski> ... > taskt(f, i ... t ∈ [1, m]), then Choose the longest task task of remaining runtimef, its all not processed data block is filtered, result is saved to one and is arranged In table blocks, if there is the available data block of n block, then blocks={ block1,block2,...,blockn, algorithm traversal Then data block is passed to the task task being initially completed by all available data blockst, when final updating tasks leave is run Between Hash table, tasks list of resequencing and removes the data block be scheduled, repeatedly the process from blocks list, Until meeting one of following two condition:
(1) blocks=Φ;
(2)
That is, the remaining runtime of rear task can be scheduled or redistributed without available data block most less than other tasks Big remaining runtime.
CN201710456187.XA 2017-06-16 2017-06-16 A kind of unbalanced method of processing big data platform Spark data distribution Pending CN109144707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710456187.XA CN109144707A (en) 2017-06-16 2017-06-16 A kind of unbalanced method of processing big data platform Spark data distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710456187.XA CN109144707A (en) 2017-06-16 2017-06-16 A kind of unbalanced method of processing big data platform Spark data distribution

Publications (1)

Publication Number Publication Date
CN109144707A true CN109144707A (en) 2019-01-04

Family

ID=64830230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710456187.XA Pending CN109144707A (en) 2017-06-16 2017-06-16 A kind of unbalanced method of processing big data platform Spark data distribution

Country Status (1)

Country Link
CN (1) CN109144707A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442594A (en) * 2019-07-18 2019-11-12 华东师范大学 A kind of Dynamic Execution method towards Spark SQL Aggregation Operators
CN111061565A (en) * 2019-12-12 2020-04-24 湖南大学 Two-stage pipeline task scheduling method and system in Spark environment
CN111445213A (en) * 2020-03-31 2020-07-24 乌鲁木齐众维汇联信息科技有限公司 Network management system for incubation service of park enterprise
CN112214291A (en) * 2019-07-12 2021-01-12 杭州海康汽车技术有限公司 Task scheduling method and device
CN113626207A (en) * 2021-10-12 2021-11-09 苍穹数码技术股份有限公司 Map data processing method, device, equipment and storage medium
CN114162551A (en) * 2020-09-11 2022-03-11 同方威视技术股份有限公司 Graph judging task control method for security inspection system and security inspection system
CN117115825A (en) * 2023-10-23 2023-11-24 深圳市上融科技有限公司 Method for improving license OCR recognition rate

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106681823A (en) * 2015-11-05 2017-05-17 田文洪 Load balancing method for processing MapReduce data skew
CN106682116A (en) * 2016-12-08 2017-05-17 重庆邮电大学 OPTICS point sorting clustering method based on Spark memory computing big data platform
US20170169800A1 (en) * 2015-09-03 2017-06-15 Synthro Inc. Systems and techniques for aggregation, display, and sharing of data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169800A1 (en) * 2015-09-03 2017-06-15 Synthro Inc. Systems and techniques for aggregation, display, and sharing of data
CN106681823A (en) * 2015-11-05 2017-05-17 田文洪 Load balancing method for processing MapReduce data skew
CN106682116A (en) * 2016-12-08 2017-05-17 重庆邮电大学 OPTICS point sorting clustering method based on Spark memory computing big data platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIADONG YU: "SASM:Improving spark performance with adaptive skew mitigation", 《2015 IEEE INTERNATIONAL CONFERENCE ON PROGRESS IN INFORMATICS AND COMPUTING》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214291A (en) * 2019-07-12 2021-01-12 杭州海康汽车技术有限公司 Task scheduling method and device
CN110442594A (en) * 2019-07-18 2019-11-12 华东师范大学 A kind of Dynamic Execution method towards Spark SQL Aggregation Operators
CN111061565A (en) * 2019-12-12 2020-04-24 湖南大学 Two-stage pipeline task scheduling method and system in Spark environment
CN111061565B (en) * 2019-12-12 2023-08-25 湖南大学 Two-section pipeline task scheduling method and system in Spark environment
CN111445213A (en) * 2020-03-31 2020-07-24 乌鲁木齐众维汇联信息科技有限公司 Network management system for incubation service of park enterprise
CN114162551A (en) * 2020-09-11 2022-03-11 同方威视技术股份有限公司 Graph judging task control method for security inspection system and security inspection system
CN114162551B (en) * 2020-09-11 2023-02-24 同方威视技术股份有限公司 Graph judging task control method for security inspection system and security inspection system
CN113626207A (en) * 2021-10-12 2021-11-09 苍穹数码技术股份有限公司 Map data processing method, device, equipment and storage medium
CN117115825A (en) * 2023-10-23 2023-11-24 深圳市上融科技有限公司 Method for improving license OCR recognition rate
CN117115825B (en) * 2023-10-23 2024-01-26 深圳市上融科技有限公司 Method for improving license OCR recognition rate

Similar Documents

Publication Publication Date Title
CN109144707A (en) A kind of unbalanced method of processing big data platform Spark data distribution
Schwarzkopf et al. Omega: flexible, scalable schedulers for large compute clusters
US7065766B2 (en) Apparatus and method for load balancing of fixed priority threads in a multiple run queue environment
US6748593B1 (en) Apparatus and method for starvation load balancing using a global run queue in a multiple run queue system
US6560628B1 (en) Apparatus, method, and recording medium for scheduling execution using time slot data
US6735769B1 (en) Apparatus and method for initial load balancing in a multiple run queue system
EP2212806B1 (en) Allocation of resources for concurrent query execution via adaptive segmentation
US20090025004A1 (en) Scheduling by Growing and Shrinking Resource Allocation
US20030225815A1 (en) Apparatus and method for periodic load balancing in a multiple run queue system
CN111651864B (en) Event centralized emission type multi-heterogeneous time queue optimization simulation execution method and system
TWI786564B (en) Task scheduling method and apparatus, storage media and computer equipment
CN106155794B (en) A kind of event dispatcher method and device applied in multi-threaded system
JP6428476B2 (en) Parallelizing compilation method and parallelizing compiler
D'Amico et al. Holistic slowdown driven scheduling and resource management for malleable jobs
CN116010064A (en) DAG job scheduling and cluster management method, system and device
US20030110204A1 (en) Apparatus and method for dispatching fixed priority threads using a global run queue in a multiple run queue system
CN106775975B (en) Process scheduling method and device
CN110928666A (en) Method and system for optimizing task parallelism based on memory in Spark environment
Gharajeh et al. Heuristic-based task-to-thread mapping in multi-core processors
Ramamritham et al. Scheduling strategies adopted in spring: An overview
CN113225269B (en) Container-based workflow scheduling method, device and system and storage medium
CN115202810A (en) Kubernetes working node distribution method and system
CN114995971A (en) Method and system for realizing pod batch scheduling in kubernets
Zhang et al. Cost-efficient and latency-aware workflow scheduling policy for container-based systems
CN112286631A (en) Kubernetes resource scheduling method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right

Effective date of registration: 20191107

Address after: 610000 Room No. 7, Floor 12, Electronic and Information Industry Building No. 159, East First Ring Road, Chenghua District, Chengdu City, Sichuan Province

Applicant after: Chengdu Zhongke Cluster Information Technology Co., Ltd.

Address before: 610000 Chenghua District, Chengdu City, Sichuan Province, No. 4 University of Electronic Science and Technology

Applicant before: Tian Wenhong

Applicant before: Huang Chaojie

Applicant before: Liu Hongyi

Applicant before: Ren Xiaoqin

Applicant before: He Majun

Applicant before: Ye Yufei

TA01 Transfer of patent application right
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190104

RJ01 Rejection of invention patent application after publication