CN106502790A - A kind of task distribution optimization method based on data distribution - Google Patents

A kind of task distribution optimization method based on data distribution Download PDF

Info

Publication number
CN106502790A
CN106502790A CN201610890105.8A CN201610890105A CN106502790A CN 106502790 A CN106502790 A CN 106502790A CN 201610890105 A CN201610890105 A CN 201610890105A CN 106502790 A CN106502790 A CN 106502790A
Authority
CN
China
Prior art keywords
task
distribution
node
global
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610890105.8A
Other languages
Chinese (zh)
Inventor
王洪添
李萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Cloud Service Information Technology Co Ltd
Original Assignee
Shandong Inspur Cloud Service Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Cloud Service Information Technology Co Ltd filed Critical Shandong Inspur Cloud Service Information Technology Co Ltd
Priority to CN201610890105.8A priority Critical patent/CN106502790A/en
Publication of CN106502790A publication Critical patent/CN106502790A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of task distribution optimization method based on data distribution, which realizes that process is:According to the network distance between node and the data transfer cost of intermediate result weight distribution assessment of scenario reduce tasks;Show that the optimum of each task executes node set according to reduce tasks data transfer cost on different nodes;Specific Task Assigned Policy and algorithm are provided based on the optimum node set that executes.Task distribution optimization method that should be based on data distribution is compared with prior art, the data transfer that execute reduce task bring effectively is reduced, can be network access request that MapReduce programs reduce about 12%, and the operation response time also shortens 9% or so, practical.

Description

Task allocation optimization method based on data distribution
Technical Field
The invention relates to the technical field of computer data integration, in particular to a task allocation optimization method which is high in practicability and based on data distribution.
Background
The explosive growth of information pushes the internet to enter a big data era, nowadays, big data becomes an important strategic resource and a novel decision-making mode, and cloud computing provides strong computing and storage capacity for big data processing and analysis. With the rise of big data and cloud computing, more and more companies are beginning to provide cloud services using MapReduce and Hadoop. The MapReduce is a programming model proposed by google, and is usually used for parallel operation of large-scale data sets, and the Hadoop is a parallel programming framework which realizes open sources including the MapReduce model and a distributed file system (HDFS), and has the characteristics of high efficiency, high reliability, high fault tolerance, low cost and scalability.
Network bandwidth has been a bottleneck restricting the development of cloud computing, and is also one of the current research hotspots. As shown in fig. 1, the MapReduce program can be abstracted into two specific functions: the device comprises a map function and a reduce function, wherein the map function is responsible for decomposing input data and performing primary processing, and the reduce function is responsible for summarizing intermediate results to obtain a final result. The MapReduce framework generally constructs map tasks on nodes storing data blocks, so that data transmission and occupation of network bandwidth can be reduced. The reduce tasks do not have the advantage of data localization, however, because the input of a single reduce task usually comes from the output of multiple map tasks, and each reduce task needs to output the final result to the HDFS, the input and output of the reduce function need to occupy network bandwidth.
Based on the method, the task allocation optimization method based on data distribution is provided, network and I/O expenses caused by data transmission are reduced through the starting nodes reasonably allocating the reduce tasks, and meanwhile the performance of the MapReduce program is improved.
Disclosure of Invention
Aiming at the defects, the technical task of the invention provides a task allocation optimization method which is strong in practicability and based on data distribution.
A task allocation optimization method based on data distribution is specifically realized by the following steps:
firstly, evaluating the data transmission cost of the reduce task according to the network distance between the nodes and the weight distribution condition of the intermediate result;
secondly, obtaining an optimal execution node set of each task according to the data transmission cost of the reduce task on different nodes;
and thirdly, giving out a specific task allocation strategy and algorithm based on the optimal execution node set.
The network distance between the nodes specifically refers to: when the MapReduce program has m map tasks Mi and n reduce tasks Rj, wherein i is more than or equal to 0 and less than or equal to m, j is more than or equal to 0 and less than or equal to n, and the input of each reduce task is from the output of all map tasks; the intermediate result generated by the map task is transmitted to the nodes running the reduce task through the network, and the sum of the distances from the nodes where all the map tasks are located to the nodes where the reduce task Rj is located is the total network distance TND of the RjRj
And the intermediate result weight distribution is restored to a local prediction distribution map by acquiring global distribution information, the weight distribution condition of the intermediate result is counted and predicted by taking a key value pair as a granularity, and the data transmission cost of the reduce task is evaluated by combining a network distance.
The specific process of acquiring the global distribution information is as follows:
1) when the execution progress of the map stage is α, each node counts the intermediate result key value pair, wherein slowstartconf≤α≤1,slowstartconfA parameter configured for the user and indicating that the ratio of the map task completed when being executed reaches slowstartconfWhen the task is executed, the reduce task is executed;
2) when each node partitions the intermediate result according to the partition function, counting key value pairs corresponding to the intermediate result to generate a series of (k, n) tuples and sequencing the tuples from large to small according to the value of n;
3) setting a global truncation threshold theta, namely only using the first theta% of (k, n) tuple lists in the local distribution diagram as a basis for constructing the global distribution diagram, wherein the key value logarithm n of the theta% of (k, n) in the local distribution diagram is called as a local truncation thresholdThe profile after truncation is called a local truncation profile L;
4) constructing a global distribution map G: first, global score is definedLower cloth limit GLAnd global distribution ceiling GUThey respectively represent the maximum value and the minimum value of the number of corresponding element groups of each key obtained by the local truncation distribution diagram and the local truncation threshold value, and then set the global distribution lower limit GL={(k,NL) K ∈ K }, upper limit of global distribution GU={(k,NU) I K ∈ K }, then there areWherein,
5) if the global profile G { (K, N) | K ∈ K }, the intermediate value between the upper limit and the lower limit is the result of the global profile, i.e., the result is the global profile
6) Performing prediction correction on the global distribution diagram according to historical distribution, assuming that the distribution deviation of any key is the difference between the current distribution ratio and the historical distribution ratio, and selecting the key with the maximum distribution deviation as a correction key kcAnd with (k)cN) and kcThe historical distribution proportion predicts the total number of key value pairs of the intermediate result, and further predicts the key value pair prediction value corresponding to each key, and the corrected global distribution map is called a global prediction distribution map Gc
The specific process of restoring the local prediction distribution map through the global distribution information comprises the following steps:
the local prediction distribution map is LcFrom the global distribution graph G, for any key k, if (k, n) ∈ LiThen it is to LcContribution is n, otherwise contribution isPredicting the logarithm N of key value to be generated based on global prediction distribution diagram and global distribution diagramcThe number of element groups is proportionally divided according to the running progress of each map task, namely if the progress of the map task isThen the logarithm of the predicted key value of key k in the intermediate result corresponding to the task is
The data transmission cost of the reduce task is evaluated in the first step specifically as follows: data transmission Cost of node w executing reduce task rw/rFor r, the sum of the data transmission costs of pulling the corresponding intermediate result key-value pair from each node, i.e.Wherein m isiTo execute the node of map task i, d (w, m)i) Is the network distance between two nodes, rinputA set of input key-value pairs of r.
The optimal execution node set obtained in the step two is the optimal execution node set N for obtaining the reduce task roptimal(r), performing a task r at any node w in the set results in a minimal Cost for data transmission Costw/rThe specific process comprises the following steps:
when the optimal task set R of any nodeoptimal(n) when the node n pulls a set formed by tasks with minimum data transmission cost of intermediate result key value pairs in all the unexecuted reduce tasks, and the current node is not the optimal execution node of any task, the task selector allocates R to the current nodeoptimal(n);
when a node requests a reduce task, firstly, an optimal execution node set of the unexecuted task is sequentially acquired, and if the current node is the optimal execution node of the task, the task is returned; otherwise, adding 1 to the skip count attribute of the task, wherein the skip count records the number of times that each task is skipped because the optimal execution node request cannot be obtained; if the current node is not the optimal execution node of any task, acquiring an optimal execution task list of the current node, and selecting and distributing the task with the largest skip count; and the optimal execution node and the optimal execution task are periodically updated before the execution of the map stage is finished so as to ensure the real-time performance of the scheduling.
The task allocation optimization method based on data distribution has the following advantages that:
according to the task allocation optimization method based on data distribution, a local prediction distribution map is restored by acquiring more accurate global distribution information, and the weight distribution condition of an intermediate result is counted and predicted by taking a key value pair as a granularity; evaluating the data transmission cost of the reduce task according to the network distance between the nodes and the weight distribution condition of the intermediate result, and providing the accuracy of data perception and the network transmission cost balanced by a truncation prediction method; optimizing a distribution strategy of the reduce task in the cloud computing environment and giving a specific algorithm based on the optimal execution node set of the task and the optimal task set of the nodes; on the basis of a job-level scheduling strategy, network and I/O (input/output) expenses caused by data transmission are reduced by reasonably distributing the starting nodes of the reduce tasks, and meanwhile, the performance of a MapReduce program is improved.
Drawings
FIG. 1 is a MapReduce data flow diagram.
FIG. 2 is a schematic diagram of a Hadoop cluster network architecture.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
The Hadoop cluster adopts a master-slave architecture and a tree network topology. In a cloud computing environment, a data center usually includes a plurality of racks, each rack is equipped with a plurality of servers, and the architecture is characterized in that: the total bandwidth between nodes within the same rack is much higher than the bandwidth between nodes in different racks. The invention fully utilizes the characteristic and reduces the data transmission among the racks by reasonably distributing the starting nodes of the reduce task.
As shown in fig. 2, the task allocation optimization method based on data distribution of the present invention provides a task allocation optimization strategy taking data transmission cost as an evaluation index, with sensing data distribution as a core. The strategy adopts a greedy algorithm idea, calculates an optimal execution node set by constructing a local prediction distribution graph of an intermediate result, and reduces data transmission in the reduce task execution process as much as possible, thereby reducing network and I/O (input/output) expenses brought by data transmission, and simultaneously improving the time performance of an application program and the throughput rate of the whole cluster.
The main contents comprise:
firstly, evaluating the data transmission cost of the reduce task according to the network distance between the nodes and the weight distribution condition of the intermediate result;
secondly, obtaining an optimal execution node set of each task according to the data transmission cost of the reduce task on different nodes;
and thirdly, giving out a specific task allocation strategy and algorithm based on the optimal execution node set.
The network distance between the nodes specifically refers to: in a cloud computing environment, a MapReduce program is assumed to have m (i is more than or equal to 0 and less than or equal to m) map tasks Mi and n (j is more than or equal to 0 and less than or equal to n) reduce tasks Rj, and the input of each reduce task is from the output of all map tasks. Because the intermediate result generated by the map task needs to be transmitted to the node running the reduce task through the network, the sum of the distances from the node where all the map tasks are located to the node where the reduce task Rj is located is called the Total Network Distance (TND) of RjRj). Apparently TNDRjThe larger the intermediate result that needs to be transmitted to the reduce task, the slower the data transmission speed.
As shown in FIG. 2, assume a Hadoop cluster contains two machinesRacks, N0-N9 represent 10 slave nodes in a Hadoop cluster, respectively, where N0-N4 are located in rack 1, and N5-N9 are located in rack 2; suppose that the MapReduce program has 6 map tasks and 4 reduce tasks, wherein the map tasks are respectively located at nodes N0, N1, N2, N3, N5 and N6, and the reduce tasks are respectively located at nodes N3, N4, N6 and N7; meanwhile, assuming that the network distance from each child node to the parent node of the child node in the Hadoop cluster is 1, the network distance between two nodes in the same rack is 2, and the network distance between two nodes in different racks is 4. Next, the total network distance TND of each reduce task is calculated separatelyR0、TNDR1、TNDR2And TNDR3
TNDR0=3×2+2×4=14;
TNDR1=4×2+2×4=16;
TNDR2=4×4+1×2=18;
TNDR3=4×4+2×2=20;
It can be seen that when the reduce tasks are located in different slave nodes, the overall network distances are different. This indicates that the reduce task also has data localization properties, but unlike the map task, the reduce task is more concerned with map input on the entire chassis than input data on a single node. The total network distance is 14 when the reduce task is located at node N3 in rack 1 and 20 when the reduce task is located at node N7 in rack 2. Therefore, the starting node of the reduce task is reasonably selected, so that the whole network distance can be reduced, the duration of the shuffle stage is shortened, and the time performance of the application program is improved.
Besides considering the network distance of the reduce task, the weight distribution of the intermediate result is also an important factor for measuring the data transmission cost. The distribution of intermediate result key-value pairs can be collected and counted with partition as granularity, but there are still two problems: 1) in order to reduce the delay caused by the transmission of intermediate result data, the scheduling of the reduce task generally starts before the map stage is completely finished, and the partition size and the final key value pair distribution at the moment may have great difference, which may cause inaccurate scheduling result; 2) the distribution of key-value pairs usually has a certain regularity, and even if the final partition distribution is predicted by the method, the distribution of the final key-value pairs cannot be predicted and corrected by the existing knowledge due to the problems that the partition granularity is large and the final partition distribution completely depends on the partition function. Aiming at the problems, the invention takes the key value pair as the granularity to carry out statistics and prediction on the weight distribution of the intermediate result, and evaluates the data transmission cost of the reduce task by combining the network distance.
The number of intermediate result key-value pairs needs to be counted on each node executing the map task, but since the data volume is large, the data distribution collector that each node transmits all key-value pair tuples (k, n) to the master node consumes more network resources and time. On the other hand, obtaining absolutely accurate key-value pair distribution information is not so meaningful that this approach is not reasonable. The invention firstly obtains more accurate global distribution information, and then reduces a local prediction distribution map according to the global prediction distribution map to calculate the data transmission cost required by executing the task, and the specific statistical process is as follows:
(1) when the execution progress of the map phase is α (slowstart)confα is less than or equal to 1), each node counts the intermediate result key value pair, wherein slowstartconfA parameter configured for the user and indicating that the ratio of the map task completed when being executed reaches slowstartconfAt that time, the reduce task starts to be executed.
(2) When each node partitions the intermediate result according to the partition function, counting key value pairs corresponding to the intermediate result, generating a series of (k, n) tuples and sequencing the tuples from large to small according to the value of n.
(3) Setting a global truncation threshold theta, namely only using the first theta% of (k, n) tuple lists in the local distribution diagram as a basis for constructing the global distribution diagram, wherein the key value logarithm n of the theta% of (k, n) in the local distribution diagram is called as a local truncation thresholdThe truncated profile is referred to as the local truncated profile L.
(4) A global profile G is constructed. First, a global distribution lower bound (G) is definedL) And global distribution ceiling (G)U) They respectively represent the maximum value and the minimum value of the number of corresponding element groups of each key obtained by the local truncation distribution diagram and the local truncation threshold value, and then set the global distribution lower limit GL={(k,NL) L K ∈ K, global distribution upper bound GU={(k,NU) I K ∈ K }, then there are Wherein,
(5) if the global profile G { (K, N) | K ∈ K }, the intermediate value between the upper limit and the lower limit is the result of the global profile, i.e., the result is the global profile
(6) Performing prediction correction on the global distribution diagram according to historical distribution, assuming that the distribution deviation of any key is the difference between the current distribution ratio and the historical distribution ratio, and selecting the key with the maximum distribution deviation as a correction key kcAnd with (k)cN) and kcThe historical distribution proportion predicts the total number of key value pairs of the intermediate result, and further predicts the key value pair prediction value corresponding to each key, and the corrected global distribution map is called a global prediction distribution map Gc
(7) Restoring the local prediction profile L from the global prediction profilecFrom the global distribution G, for any key k, if (k, n) ∈ LiThen it is to LcContribution is n, otherwise contribution isThe logarithm N of key values to be generated can be predicted based on the global prediction distribution diagram and the global distribution diagramcThe number of element groups is proportionally divided according to the running progress of each map task, namely if the progress of the map task isThen the logarithm of the predicted key value of key k in the intermediate result corresponding to the task is
In summary, based on the network distance between nodes and the intermediate result weight distribution, it can be obtained that: data transmission Cost of node w executing reduce task rw/rFor r, the sum of the data transmission costs of pulling the corresponding intermediate result key-value pair from each node, i.e.Wherein m isiTo execute the node of map task i, d (w, m)i) Is the network distance between two nodes, rinputA set of input key-value pairs of r.
The optimal execution node set obtained in the step two is the optimal execution node set N for obtaining the reduce task roptimal(r), performing a task r at any node w in the set results in a minimal Cost for data transmission Costw/rIn order to reduce network and I/O overhead brought by intermediate result data transmission, the optimal allocation scheme of the reduce task is to allocate all tasks to respective optimal execution nodes for execution, so that the lowest overall data transmission cost is achieved. But sometimes to meet the user's job response time requirements in the service level agreement in real time, the service provider must complete the assignment of all tasks by time. Under this constraint, it may cause that part of the reduce task cannot be executed on its optimal execution node.
Furthermore, the best performanceWhether there are resources available on a row node is also one of the factors that constrain task allocation. To solve this problem, an optimal task set R of arbitrary nodes is assumedoptimalAnd (n) is a set formed by pulling the intermediate result key value by the node n to the task with the minimum data transmission cost in all the unexecuted reduce tasks. In the case where the current node is not the optimal executing node for any task, the task selector will assign R to itoptimalThe task in (n) is specifically allocated according to the following algorithm:
when a node requests a reduce task, firstly, an optimal execution node set of the unexecuted task is sequentially acquired, and if the current node is the optimal execution node of the task, the task is returned; otherwise, add 1 to the skipcount attribute for the task, which records the number of times each task is skipped because it does not get the best performing node request (lines 1-8). If the current node is not the optimal execution node of any task, the optimal execution task list of the current node is obtained, and the task with the largest distribution skip count is selected (lines 9-16). And the optimal execution node and the optimal execution task are periodically updated before the execution of the map stage is finished so as to ensure the real-time performance of the scheduling.
Under the cloud computing environment, the task allocation optimization method can effectively reduce data transmission brought by executing reduce tasks, can reduce network access requests for MapReduce programs by about 12%, and shortens operation response time by about 9%.
The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of a data distribution-based task assignment optimization method of the present invention and are made by those skilled in the art should fall within the protection scope of the present invention.

Claims (7)

1. A task allocation optimization method based on data distribution is characterized in that the implementation process is as follows:
firstly, evaluating the data transmission cost of the reduce task according to the network distance between the nodes and the weight distribution condition of the intermediate result;
secondly, obtaining an optimal execution node set of each task according to the data transmission cost of the reduce task on different nodes;
and thirdly, giving out a specific task allocation strategy and algorithm based on the optimal execution node set.
2. The method for optimizing task allocation based on data distribution according to claim 1, wherein the network distance between the nodes specifically refers to: when the MapReduce program has m map tasks Mi and n reduce tasks Rj, wherein i is more than or equal to 0 and less than or equal to m, j is more than or equal to 0 and less than or equal to n, and the input of each reduce task is from the output of all map tasks; the intermediate result generated by the map task is transmitted to the nodes running the reduce task through the network, and the sum of the distances from the nodes where all the map tasks are located to the nodes where the reduce task Rj is located is the total network distance TND of the RjRj
3. The method as claimed in claim 1, wherein the intermediate result weight distribution is restored by obtaining global distribution information to obtain a local prediction distribution map, the weight distribution of the intermediate result is counted and predicted by taking key value pairs as granularity, and the data transmission cost of the reduce task is evaluated by combining network distance.
4. The method for optimizing task allocation based on data distribution according to claim 3, wherein the specific process of obtaining the global distribution information is as follows:
1) when the execution progress of the map stage is α, each node counts the intermediate result key value pair, wherein slowstartconf≤α≤1,slowstartconfA parameter configured for the user and indicating that the ratio of the map task completed when being executed reaches slowstartconfWhen the task is executed, the reduce task is executed;
2) when each node partitions the intermediate result according to the partition function, counting key value pairs corresponding to the intermediate result to generate a series of (k, n) tuples and sequencing the tuples from large to small according to the value of n;
3) setting a global truncation threshold theta, namely only using the first theta% of (k, n) tuple lists in the local distribution diagram as a basis for constructing the global distribution diagram, wherein the key value logarithm n of the theta% of (k, n) in the local distribution diagram is called as a local truncation thresholdThe profile after the truncation is called a local truncation profile L;
4) constructing a global distribution map G: first, a global distribution lower limit G is definedLAnd global distribution ceiling GUThey respectively represent the maximum value and the minimum value of the number of corresponding element groups of each key obtained by the local truncation distribution diagram and the local truncation threshold value, and then set the global distribution lower limit GL={(k,NL) L K ∈ K, global distribution upper bound GU={(k,NU) I K ∈ K }, then there areWherein,
5) if the global profile G { (K, N) | K ∈ K }, the intermediate value between the upper limit and the lower limit is the result of the global profile, i.e., the result is the global profile
6) Performing prediction correction on the global distribution diagram according to historical distribution, assuming that the distribution deviation of any key is the difference between the current distribution ratio and the historical distribution ratio, and selecting the key with the maximum distribution deviation as a correction key kcAnd with (k)cN) and kcThe historical distribution proportion predicts the total number of key value pairs of the intermediate result, and further predicts the key value pair prediction value corresponding to each key, and the corrected global distribution map is called a global prediction distribution map Gc
5. The method for optimizing task allocation based on data distribution according to claim 4, wherein the specific process of restoring the local prediction distribution map through the global distribution information is as follows:
the local prediction distribution map is LcFrom the global distribution graph G, for any key k, if (k, n) ∈ LiThen it is to LcContribution is n, otherwise contribution isPredicting the logarithm N of key value to be generated based on global prediction distribution diagram and global distribution diagramcThe number of element groups is proportionally divided according to the running progress of each map task, namely if the progress of the map task isThen the logarithm of the predicted key value of key k in the intermediate result corresponding to the task is
6. The method for optimizing task allocation based on data distribution according to claim 5, wherein the evaluating the data transmission cost of the reduce task in the first step is specifically as follows: data transmission Cost of node w executing reduce task rw/rFor r, the sum of the data transmission costs of pulling the corresponding intermediate result key-value pair from each node, i.e.Wherein m isiTo execute the node of map task i, d (w, m)i) Is the network distance between two nodes, rinputA set of input key-value pairs of r.
7. The method as claimed in claim 1, wherein the optimal executing node set of each task obtained in the second step is the optimal executing node set N for obtaining the reduce task roptimal(r), performing a task r at any node w in the set results in a minimal Cost for data transmission Costw/rThe specific process comprises the following steps:
when the optimal task set R of any nodeoptimal(n) is all not yet executed reduceIn the e task, when the node n pulls a set formed by tasks with the minimum data transmission cost of the intermediate result key value pair, and the current node is not the optimal execution node of any task, the task selector allocates R to the current nodeoptimal(n);
when a node requests a reduce task, firstly, an optimal execution node set of the unexecuted task is sequentially acquired, and if the current node is the optimal execution node of the task, the task is returned; otherwise, adding 1 to the skip count attribute of the task, wherein the skip count records the number of times that each task is skipped because the optimal execution node request cannot be obtained; if the current node is not the optimal execution node of any task, acquiring an optimal execution task list of the current node, and selecting and distributing the task with the largest skip count; and the optimal execution node and the optimal execution task are periodically updated before the execution of the map stage is finished so as to ensure the real-time performance of the scheduling.
CN201610890105.8A 2016-10-12 2016-10-12 A kind of task distribution optimization method based on data distribution Pending CN106502790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610890105.8A CN106502790A (en) 2016-10-12 2016-10-12 A kind of task distribution optimization method based on data distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610890105.8A CN106502790A (en) 2016-10-12 2016-10-12 A kind of task distribution optimization method based on data distribution

Publications (1)

Publication Number Publication Date
CN106502790A true CN106502790A (en) 2017-03-15

Family

ID=58295238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610890105.8A Pending CN106502790A (en) 2016-10-12 2016-10-12 A kind of task distribution optimization method based on data distribution

Country Status (1)

Country Link
CN (1) CN106502790A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506388A (en) * 2017-07-27 2017-12-22 浙江工业大学 A kind of iterative data balancing optimization method towards Spark parallel computation frames
CN109496321A (en) * 2017-07-10 2019-03-19 欧洲阿菲尼帝科技有限责任公司 For estimating the technology of the expection performance in task distribution system
CN109871265A (en) * 2017-12-05 2019-06-11 航天信息股份有限公司 The dispatching method and device of Reduce task
CN109947559A (en) * 2019-02-03 2019-06-28 百度在线网络技术(北京)有限公司 Optimize method, apparatus, equipment and computer storage medium that MapReduce is calculated
CN113467700A (en) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 Data distribution method and device based on heterogeneous storage

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120151292A1 (en) * 2010-12-14 2012-06-14 Microsoft Corporation Supporting Distributed Key-Based Processes
CN102541858A (en) * 2010-12-07 2012-07-04 腾讯科技(深圳)有限公司 Data equality processing method, device and system based on mapping and protocol
CN102629219A (en) * 2012-02-27 2012-08-08 北京大学 Self-adaptive load balancing method for Reduce ends in parallel computing framework
CN103279351A (en) * 2013-05-31 2013-09-04 北京高森明晨信息科技有限公司 Method and device for task scheduling
US20130290972A1 (en) * 2012-04-27 2013-10-31 Ludmila Cherkasova Workload manager for mapreduce environments
US20160034482A1 (en) * 2014-07-31 2016-02-04 International Business Machines Corporation Method and apparatus for configuring relevant parameters of mapreduce applications
CN105589752A (en) * 2016-02-24 2016-05-18 哈尔滨工业大学深圳研究生院 Cross-data center big data processing based on key value distribution

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541858A (en) * 2010-12-07 2012-07-04 腾讯科技(深圳)有限公司 Data equality processing method, device and system based on mapping and protocol
US20120151292A1 (en) * 2010-12-14 2012-06-14 Microsoft Corporation Supporting Distributed Key-Based Processes
CN102629219A (en) * 2012-02-27 2012-08-08 北京大学 Self-adaptive load balancing method for Reduce ends in parallel computing framework
US20130290972A1 (en) * 2012-04-27 2013-10-31 Ludmila Cherkasova Workload manager for mapreduce environments
CN103279351A (en) * 2013-05-31 2013-09-04 北京高森明晨信息科技有限公司 Method and device for task scheduling
US20160034482A1 (en) * 2014-07-31 2016-02-04 International Business Machines Corporation Method and apparatus for configuring relevant parameters of mapreduce applications
CN105589752A (en) * 2016-02-24 2016-05-18 哈尔滨工业大学深圳研究生院 Cross-data center big data processing based on key value distribution

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王捷: "基于SLA的MapReduce调度机制研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109496321A (en) * 2017-07-10 2019-03-19 欧洲阿菲尼帝科技有限责任公司 For estimating the technology of the expection performance in task distribution system
CN107506388A (en) * 2017-07-27 2017-12-22 浙江工业大学 A kind of iterative data balancing optimization method towards Spark parallel computation frames
CN109871265A (en) * 2017-12-05 2019-06-11 航天信息股份有限公司 The dispatching method and device of Reduce task
CN109947559A (en) * 2019-02-03 2019-06-28 百度在线网络技术(北京)有限公司 Optimize method, apparatus, equipment and computer storage medium that MapReduce is calculated
CN109947559B (en) * 2019-02-03 2021-11-23 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for optimizing MapReduce calculation
CN113467700A (en) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 Data distribution method and device based on heterogeneous storage
CN113467700B (en) * 2020-03-31 2024-04-23 阿里巴巴集团控股有限公司 Heterogeneous storage-based data distribution method and device

Similar Documents

Publication Publication Date Title
CN112153700B (en) Network slice resource management method and equipment
US9201690B2 (en) Resource aware scheduling in a distributed computing environment
CN106502790A (en) A kind of task distribution optimization method based on data distribution
US20130290972A1 (en) Workload manager for mapreduce environments
CN108268318A (en) A kind of method and apparatus of distributed system task distribution
CN112306651B (en) Resource allocation method and resource borrowing method
CN103699433B (en) One kind dynamically adjusts number of tasks purpose method and system in Hadoop platform
CN110221920B (en) Deployment method, device, storage medium and system
Liu et al. Preemptive hadoop jobs scheduling under a deadline
CN109189572B (en) Resource estimation method and system, electronic equipment and storage medium
CN112463390A (en) Distributed task scheduling method and device, terminal equipment and storage medium
Li et al. An effective scheduling strategy based on hypergraph partition in geographically distributed datacenters
Chen et al. Latency minimization for mobile edge computing networks
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
CN105227601A (en) Data processing method in stream processing system, device and system
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
CN117806659A (en) ES high-availability cluster containerized deployment method and related device
CN110048966B (en) Coflow scheduling method for minimizing system overhead based on deadline
CN109150759B (en) Progressive non-blocking opportunity resource reservation method and system
Guo Ant colony optimization computing resource allocation algorithm based on cloud computing environment
CN116915869A (en) Cloud edge cooperation-based time delay sensitive intelligent service quick response method
Yassir et al. Graph-based model and algorithm for minimising big data movement in a cloud environment
CN106844037B (en) KNL-based test method and system
CN115562841A (en) Cloud video service self-adaptive resource scheduling system and method
CN105187483A (en) Method and device for allocating cloud computing resources

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170315

RJ01 Rejection of invention patent application after publication