CN107506388A - A kind of iterative data balancing optimization method towards Spark parallel computation frames - Google Patents
A kind of iterative data balancing optimization method towards Spark parallel computation frames Download PDFInfo
- Publication number
- CN107506388A CN107506388A CN201710623289.6A CN201710623289A CN107506388A CN 107506388 A CN107506388 A CN 107506388A CN 201710623289 A CN201710623289 A CN 201710623289A CN 107506388 A CN107506388 A CN 107506388A
- Authority
- CN
- China
- Prior art keywords
- partition
- data
- bucket
- logical micro
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of iterative data balancing partition method towards Spark parallel computation frames:First, big data coarseness Block is divided into fine-grained FG Block, logical micro-partition is created according to FG Block and logical micro-partition indexes;Secondly, the Bucket of equivalent is created according to Reducer quantity;3rd, determine the criterion of opportunity of iterative data partition, quantity and iterative subregion;4th, record each Bucket part and global data distribution condition;5th, the logical micro-partition of selection is assigned to by each Bucket according to data balancing partitioning algorithm and distribution condition;Finally, allocated data in Bucket are transferred to Reducer ends.The present invention proposes a kind of new data balancing partition method for Spark frameworks, reduces the data skew in big data processing procedure, improves the big data processing overall performance of Spark parallel computation frames.
Description
Technical field
The present invention relates to big data processing and high-performance computing sector, particularly proposes one kind and is counted parallel towards Spark
Calculate the iterative data balancing optimization method of framework.
Background technology
MapReduce is a kind of parallel computational model for big data processing that Google companies proposed in 2004,
By the way that running multiple tasks concurrently handles mass data simultaneously in a large amount of cheap cluster nodes, the property of processing data is improved
Can, rapid development and extensive use had been obtained in more than ten years in past.Spark is a parallel computation based on MapReduce
Framework, in the AMPLab development in laboratory by Univ California-Berkeley in 2009, there is the advantages of MapReduce,
And task computation intermediate result is saved in internal memory, reduce disk read-write expense, improve the performance of big data processing, into
For the main flow framework of current structure big data processing platform.
Data in big data processing procedure are unbalanced, also referred to as data skew, are to cause Spark frameworks overall performance to drop
Low important bottleneck.Lin J(Proceedings of the 7th Workshop on Large-Scale Distributed
Systems for Information Retrieval, 2009) research experiment result shows, when using the Hash given tacit consent at present
During partition method, the Reducer having up to 92% generates the unbalanced phenomena of data, overall performance is reduced 22% and arrives
38%.
For the data skew problem in big data processing procedure, domestic and international researcher proposes different methods, greatly
Cause can be summarized as following a few classes:(1) two benches data partition method:This method is by Gufler (Proceedings of the
1st International Conference on Cloud Computing and Services Science, 2011) etc.
It is proposed, its elder generation according to disposable partition method generate data partition, and during operation analyze data partitioning scenario, such as
Fruit run-off the straight, then the subregion larger to data volume split, and by the data point reuse of fractionation to the less subregion of data volume,
So as to realize data balancing subregion.However, the validity of this method depend on adjustment subregion opportunity, too early split data volume compared with
Big subregion can increase the possibility split by mistake, and splitting the larger subregion of data volume too late can cause data delay to transmit.(2)
Multistage data partition method:This method proposes that it in the Map stages by generating carefully by Wang Zhuo etc. (Chinese journal of computers, 2016)
Granularity subregion, the balance of subregion is then assessed by the Cost Model of definition during operation, when meeting a certain condition
When the fine granularity of selection is assigned on Reducer, by multistage screening and distribution, data distribution is reached certain flat
Weighing apparatus.It is difficult to hold however, this method carries out subregion opportunity, this does not have versatility.(3) sampled data partition method:This method
By Ramakrishnan etc. (Proceedings of the 3rd ACM Symposium on Cloud Computering,
2012) propose, sampling is split and is combined by it with data, is increased an extra process in data processing implementation procedure and is responsible for
Analyze data distribution situation, after data processing is to certain proportion, data are split according to the analysis result of sampling process
And merging, i.e., the big subregion of data volume is split and merged with the less subregion of data volume;However, this method needs overhead
To collect data distribution rule, by increase data access and data transfer overhead, moreover, data sampling is also uncertain in the presence of certain
Property, if sampling is very few accuracy can be caused insufficient, sampling is excessive will to increase more overheads again;(4) delayed data point
Area's method:This method is by (the Proceedings of the 1st ACM Symposium on Cloud such as Kwon
Computing, 2010) propose, it assesses the size of data partition by defining Cost Model, is then assessed according to Cost Model
Produce the size of data, and the log-on data subregion when task is run to sometime.But this method requires data transfer etc.
It could be carried out after to subregion, therefore the time that delayed data is transmitted, substantial amounts of data need to wait at Mapper ends, make number
It can not be carried out simultaneously with data transfer according to processing;(5) migrating data partition method:This method is also by (Proceedings such as Kwon
Of the 2012 ACM SIGMOD internet Conference on Manangement of Data, 2012) propose,
It completes the balance of node load by the migration of data not using data balancing subregion as target, by formulating cost mould
Type, carries out remaining cost evaluation to the Reducer task not yet completed, and when meeting some requirements by the node not yet
Processed Data Migration is on completed task node, so as to realize that each node performs the equilibrium of task.But this method
Need to increase extra data transfer cost, also just delay the time of operation completion to a certain extent;
The content of the invention
In order to overcome the data skew problem in big data processing procedure, the present invention proposes a kind of parallel towards Spark
The iterative data balancing partition method of Computational frame, each Reducer is assigned to by the subregion of big data in an iterative manner
On so that the data volume handled on each Reducer reaches population equilibrium, improve Spark parallel computation frames to big data at
The overall performance of reason.
A kind of iterative data balancing partition method towards Spark parallel computation frames, comprise the following steps:
(1) logical micro-partition and logical micro-partition index are created
(1.1) fine-grained data block is created
In Spark frameworks, the data processing unit of acquiescence is coarseness data block Block, is set as its size
For 128M, to the further subdivisions of Block, multiple fine-grained data blocks (Fine-grained block, FG-Block) are obtained, it is right
These FG-Block are iterated processing;
(1.2) logical micro-partition is created
In the Mapper stages, the tuple storage to being obtained after FG-Block conversion process merges into caching Buffer
Key value identical tuples, the tuple-set after merging are referred to as logical micro-partition, and this is the base unit of iteration subregion;
(1.3) logical micro-partition index is created
Created and indexed according to the Key values of logical micro-partition, index is to be used to record which Reducer logical micro-partitions will be assigned to
On structure;
(1.4) Bucket structures are designed
It is created that there is man-to-man rely on to close by Bucket, Reducer and the Bucket of equivalent according to Reducer quantity
System, Bucket is logically divided into multiple Segment, and each Segment is merely able to store a logical micro-partition, will changed every time
The tuple-set that generation obtains is referred to as logical micro-partition iteration block, and different iteration blocks are stored in Segment Slot, and multiple logical micro-partitions
It can be assigned on identical Reducer;
(1.5) logical micro-partition vector is created
The tuple quantity of each logical micro-partition is referred to as first group factor, established between index and first group factor, distribution condition
Association, create a recording indexes and its first group factor, the logical micro-partition vector α of distribution condition:
α=(t1, t2, t3 ..., tn)
, wherein ti=(index, factor, bno, sno) (i ∈ [1, n]), index is index value, and factor is tuple
The factor, bno are Bucket numbering, and sno is Segment segment number;
(2) opportunity and the quantity of iterative data partition are determined
Completed when Mapper is handled FG-Chunk, after establishment obtains the index of logical micro-partition, proceed by current iteration
Data distribution, the quantity of subregion is arranged to Bucket quantity;
When first group factor of some logical micro-partition is much larger than first group factor of other logical micro-partitions, it will not be split
Processing, Key is in this case used as by alternative attribute and carries out multidomain treat-ment;
(3) criterion of iterative data partition is determined
When logical micro-partition quantity is more, the big logical micro-partition of prioritizing selection member group factor, it is therefore an objective to which tuple is a fairly large number of
Logical micro-partition is transferred on the Reducer of dependence as soon as possible, and allows the logical micro-partition of tuple negligible amounts to transmit later so that effectively meter
Calculation and data transmissions Overlapped Execution, and help to reduce the memory space that Bucket takes;
According to above method, it will be arranged, then be selected above according to the still unappropriated logical micro-partition of tuple factor pair
Several logical micro-partitions are assigned on Reducer;
(4) the local data distribution situation with the overall situation of record
(4.1) local data's distribution condition is recorded
Logical micro-partition vector α have recorded first group factor vector of all logical micro-partitions in a Mapper current iteration round, lead to
Cross the data partition situation of the vector description Mapper current iterations;
(4.2) global data distribution condition is recorded
Corresponding component in α on each Mapper is added up to obtain the data distribution of the overall situation, with bivector AM
Represent:
AM=Σ αs(s∈[1,m])
At the end of each Mapper iteration, logical micro-partition vector α is sent to Master, α is then added to bivector
In AM, you can obtain by the global data distribution situation of current iteration round;
Because Reducer and Bucket is man-to-man, therefore the tuple quantity distributed on each Reducer can pass through
Count number of tuples on each Bucket to measure, define a data allocation vector AB:
AB=(B1,B2,B3,…,Bb)
Wherein Br=(f1,f2,f3,…,fn) (r ∈ [1, b]), the vector is for recording each logical micro-partition in Bucket
Distribution condition:A Bucket is represented per a line, the element included per a line, which represents, is assigned to the Bucket codifferentials area
Index value;
Each Bucket is obtained by first group factor corresponding to the index factor of the every a line for the vectorial AB that adds up to be allocated
Tuple sum;
(5) design data equilibrium partitioning algorithm
The task of data balancing partitioning algorithm is in the case where Future Data distribution situation is unknown, according to having been dispensed into
Each Bucket logical micro-partition and its tuple quantity, p logical micro-partition selected by current iteration is distributed into b Bucket, and
The data volume of the Bucket after distribution is set to balance as far as possible;
Data distribution realizes that algorithm is as follows:
Step a1:Cumulative FG-Block logical micro-partition vector α obtains AM;
Step a2:Unappropriated logical micro-partition in AM is sorted from big to small;
Step a3:Bucket is sorted from small to large according to allocated data;
Step a4:Take out p unappropriated logical micro-partitions before AM, b Bucket before assigning them to;
Step a5:If there is next FG-Block, then terminate;Otherwise, a6 is gone to step;
Step a6:Bucket is sorted from small to large according to allocated data;
Step a7:Take out p unappropriated logical micro-partitions before AM, b Bucket before assigning them to;
Step a8:If unappropriated logical micro-partition in AM be present, a6 is gone to step;Otherwise, terminate;
The multiple fine-grained data block FG-Block obtained according to above procedure to subdivision are iterated processing so that distribution
Being optimal of data balancing on to each Reducer, data skew is reduced, improve Spark data processing performance.
It is an advantage of the invention that:
The present invention enters for the data skew problem in big data processing to the assigning process of Spark parallel computation frames
Row optimization, realize the equilibrium assignment of big data:Coarseness data block is finely divided, place is iterated to fine-grained data block
Reason, logical micro-partition is established according to Key values and logical micro-partition indexes, opportunity, condition, criterion and the quantity of logical micro-partition distribution are determined, with gradually
The logical micro-partition of current iteration round is assigned to Reducer ends by the mode entered, and is realized the overall balanced of big data subregion, is improved
The overall performance that Spark is handled big data.
Brief description of the drawings
Fig. 1 is the general structure of the iterative data balancing subregion of the present invention
Fig. 2 is the Bucket and Segment storage organizations of the present invention
Fig. 3 is the logical micro-partition data allocation process of the iteration twice of the present invention
Embodiment
Using the word count WordCount programs towards big data of classics as example, with reference to Fig. 1~Fig. 3, to the present invention
Embodiment be further described:
Assuming that WordCount programs will carry out word statistics to 4 Block content, and assign it to 4 nodes
On, totally 2 row, each Block data contents are as follows for data content in each Block:
Block1:
Spark is a fast and Spark is a general-purpose engine for large-scale
data processing.
Spark runs programs faster than Hadoop MapReduce in memory and on
disk.
Block2:
Spark performance is impacted by many soft system,hardware and
dataset factors.
Spark can run both by itself,or over several existing cluster
managers.
Block3:
Big Data can be defined as large data sets are being generated from
different sources.
The use of the MapReduce and Spark are two approaches perform data
analytics.
Block4:
Apache Spark is like the MapReduce model such that it is an open
source framework.
Spark was developed within UC Berkeley's AMPLab and later released as
open source.
(1) logical micro-partition and logical micro-partition index are created
(1.1) fine-grained data block is created
Fine-grained division is carried out to Block, it is assumed that per a line as a FG-Block, then the particulate obtained after segmenting
The quantity of degrees of data block is 2;
(1.2) logical micro-partition is created
In the Mapper stages, the tuple storage to being obtained after every a line word conversion process is closed into caching Buffer
And Key value identical tuples, the tuple-set after merging are referred to as logical micro-partition, this is the base unit of iteration subregion, with first
Exemplified by individual Block:
The logical micro-partition that first FG-Block is obtained is:
(Spark,2)(is,2)(a,2)(fast,1)(and,1)(general-purpose,1)(engine,1)(for,
1)(large-scale,1)(data,1)(processing.,1)
The logical micro-partition that second FG-Block is obtained is:
(Spark,1)(runs,1)(programs,1)(faster,1)(than,1)(Hadoop,1)(MapReduce,
1)(in,1)(memory,1)(and,1)(on,1)(disk.,1)
(1.3) logical micro-partition index is created
Created and indexed according to the Key values of logical micro-partition, index is to be used to record which Reducer logical micro-partitions will be assigned to
On structure;
(1.4) Bucket structures are designed
The Bucket of equivalent is created that according to Reducer quantity, now due to there are 4 Reducer, therefore creates 4
Bucket, Fig. 2 give Bucket and Segment storage organizations, and Bucket includes multiple elements, and each element stores logical micro-partition
Index, pointer points to Segment, and Segment quantity determines by the quantity for being assigned to the Bucket codifferentials area;
(1.5) logical micro-partition vector is created
α initial values such as Block1 first FG-Block are:
The first row is the index value of logical micro-partition, such as corresponding 0, the is corresponding 1 of spark here, and the second row is the member of logical micro-partition
Group factor, the number that the key occurs being represented, the third line is the Bucket marks that logical micro-partition is assigned to, and -1 represents unassigned,
Fourth line is Segment of the logical micro-partition on Bucket value, and -1 represents not creating corresponding Segment;
(2) opportunity and the quantity of iterative data partition are determined
Completed when Mapper is handled FG-Chunk, after establishment obtains the index of logical micro-partition, proceed by current iteration
Data distribution, the quantity of subregion is set 4;
(3) criterion of iterative data partition is determined
The criterion for the selection logical micro-partition taken is:The big logical micro-partition of prioritizing selection member group factor, to unappropriated logical micro-partition
It is ranked up, obtains result and sort from big to small, has determined every wheel distribution 4 above, then selected from ordering logical micro-partition
Go out 4 maximum logical micro-partitions of first group factor to be allocated;
(4) the local data distribution situation with the overall situation of record
(4.1) local data's distribution condition is recorded
Logical micro-partition vector α have recorded first group factor vector of all logical micro-partitions in a Mapper current iteration round, lead to
Cross the data partition situation of the vector description Mapper current iterations;
Block1 first FG-Block α initial values are:
Block2 first FG-Block α initial values are:
Block3 first FG-Block α initial values are:
Block4 first FG-Block α initial values are:
(4.2) global data distribution condition is recorded
Corresponding component in α on each Mapper is added up to obtain the data distribution of the overall situation, with bivector AM
Represent, now,
Data distribution vector AB initial value is:
Each Bucket is obtained by first group factor corresponding to the index factor of the every a line for the vectorial AB that adds up to be allocated
Tuple sum, now each tuple total amounts of Bucket be (0,0,0,0), represent also do not there is logical micro-partition to be assigned on Bucket;
(5) design data equilibrium partition zone optimizing algorithm
Step a1:According to description above, global data cases AM to be allocated is had been obtained for, and it is allocated
Data vector AB:
Step a2:Unappropriated logical micro-partition in AM is sorted from big to small:
Step a3:Now each Bucket data volume allocated size situation is (0,0,0,0).According to allocated data
Bucket is sorted from small to large;
Step a4:Take out preceding 4 unallocated logical micro-partitions in AM to be assigned on preceding 4 Bucket, obtain:
The data volume allocated size situation for obtaining now each Bucket is (5,4,2,2);
Step a5:Due to untreated FG-Block also be present, then algorithm terminates;
Proceeded as follows for untreated FG-Block:
Step a1:Each Block next FG-Block logical micro-partition vector α is obtained, and α is added to global change
Measure on AM, obtain:
Step a2:Unappropriated logical micro-partition in AM is sorted from big to small:
Step a3:Each Bucket data volume allocated size situation is (5,8,5,3).According to Bucket data volume pair
Bucket sorts from small to large;
Step a4:Take out preceding 4 unallocated logical micro-partitions in AM to be assigned on preceding 4 Bucket, obtain:
The data volume allocated size situation for obtaining now each Bucket is (7,10,7,6)
Step a5:Due to there is no untreated FG-Block under Block, a6 is gone to step;
Step a6:The data volume allocated size situation that each Bucket is obtained by AB is (7,10,7,6).According to
Bucket data volume sorts from small to large to Bucket;
Step a7:Take out preceding 4 unallocated logical micro-partitions in AM to be assigned on preceding 4 Bucket, obtain
Step a8:Unappropriated logical micro-partition in AM also be present, go to step a6, until unappropriated differential is not present in AM
Area;
Finally obtain:
So far logical micro-partition is all assigned, and in the absence of unappropriated logical micro-partition, processing terminates data volume on Bucket and distributed
Situation is (26,29,26,25);
Last Bucket data corresponding to being sent on Reducer, each Reducer data volume for (26,29,
26,25), counted on each Reducer in the key to oneself, statistics obtains result to the end after terminating:Word counts feelings
Condition is:
(UC,1)(approaches,1)(being,1)(cluster,1)(developed,1)(existing,1)
(for,1)(generated,1)(largescale,1)(many,1)(on,1)(perform,1)(released,1)
(several,1)(sources.,1)(that,1)(was,1)(Berkeley's,1)(Data,1)(The,1)(an,1)
(both,1)(different,1)(factors.,1)(framework.,1)(hardware,1)(it,1)(later,1)
(memory,1)(performance,1)(run,1)(soft,1)(such,1)(within,1)(Apache,1)(Big,1)
(be,1)(defined,1)(engine,1)(faster,1)(generalpurpose,1)(in,1)(large,1)
(managers.,1)(of,1)(over,1)(programs,1)(sets,1)(source.,1)(than,1)(use,1)
(AMPLab,1)(Hadoop,1)(analytics.,1)(dataset,1)(disk.,1)(fast,1)(from,1)
(impacted,1)(itself,,1)(like,1)(model,1)(or,1)(processing.,1)(runs,1)(source,
1)(system,,1)(two,1)(by,2)(open,2)(can,2)(the,2)(are,2)(as,2)(a,2)(MapReduce,
3)(data,3)(is,5)(and,5)(Spark,8)
Under Spark default partition method, each Reducer data volume is (18,33,26,29), last word system
The result being calculated is:
(two,1)(dataset,1)(hardware,1)(runs,1)(Big,1)(fast,1)(managers.,1)
(developed,1)(later,1)(several,1)(analytics.,1)(framework.,1)(over,1)
(performance,1)(model,1)(faster,1)(The,1)(different,1)(than,1)(AMPLab,1)(was,
1)(memory,1)(impacted,1)(perform,1)(sets,1)(in,1)(system,,1)(released,1)
(disk.,1)(defined,1)(for,1)(both,1)(an,1)(itself,,1)(Hadoop,1)
(generalpurpose,1)(approaches,1)(factors.,1)(UC,1)(soft,1)(sources.,1)
(cluster,1)(Apache,1)(Data,1)(engine,1)(from,1)(within,1)(processing.,1)(it,
1)(existing,1)(run,1)(that,1)(source.,1)(on,1)(many,1)(be,1)(source,1)(such,
1)(or,1)(largescale,1)(large,1)(generated,1)(of,1)(like,1)(programs,1)
(Berkeley's,1)(being,1)(use,1)(are,2)(can,2)(a,2)(the,2)(as,2)(open,2)(by,2)
(MapReduce,3)(data,3)(is,5)(and,5)(Spark,8)
Remember data skewness K
To obtain each Redcuer data volumes be (18,33,26,29) to the subregion of Spark acquiescences, set forth herein data it is equal
It is (26,29,26,25) that weighing apparatus optimization method, which obtains each Reducer data volumes, and two kinds of sides are calculated according to two Reducer
The data skewness of method:
KSpark gives tacit consent to=33/18 ≈ 1.83
KSpark optimizes=29/25 ≈ 1.16
Contrast the two K be worth to set forth herein iterative data balancing optimization method processing data inclination on compare
The partition method of Spark acquiescences improves 57.7%.Understand, set forth herein iterative data balancing optimization method can reduce
Data skew, make load more balanced.
Claims (1)
1. a kind of iterative data balancing optimization method towards Spark parallel computation frames, comprise the following steps:
(1) logical micro-partition and logical micro-partition index are created
(1.1) fine-grained data block is created
In Spark frameworks, the data processing unit of acquiescence is coarseness data block Block, as its size is traditionally arranged to be
128M, to the further subdivisions of Block, multiple fine-grained data blocks (Fine-grained block, FG-Block) are obtained, to this
A little FG-Block are iterated processing;
(1.2) logical micro-partition is created
In the Mapper stages, the tuple storage to being obtained after FG-Block conversion process merges Key values into caching Buffer
Identical tuple, the tuple-set after merging are referred to as logical micro-partition, and this is the base unit of iteration subregion;
(1.3) logical micro-partition index is created
Created and indexed according to the Key values of logical micro-partition, index will be assigned on which Reducer for recording logical micro-partition
Structure;
(1.4) Bucket structures are designed
It is created that Bucket, Reducer and the Bucket of equivalent have man-to-man dependence according to Reducer quantity,
Bucket is logically divided into multiple Segment, each Segment is merely able to store a logical micro-partition, by each iteration
Obtained tuple-set is referred to as logical micro-partition iteration block, and different iteration blocks are stored in Segment Slot, and multiple logical micro-partitions can
To be assigned on identical Reducer;
(1.5) logical micro-partition vector is created
The tuple quantity of each logical micro-partition is referred to as first group factor, associated in index with being established between first group factor, distribution condition,
Create a recording indexes and its first group factor, the logical micro-partition vector α of distribution condition:
α=(t1, t2, t3 ..., tn),
Wherein ti=(index, factor, bno, sno) (i ∈ [1, n]), index are index value, and factor is first group factor,
Bno is Bucket numbering, and sno is Segment segment number;
(2) opportunity and the quantity of iterative data partition are determined
Completed when Mapper is handled FG-Chunk, after establishment obtains the index of logical micro-partition, proceed by the number of current iteration
According to distribution, the quantity of subregion is arranged to Bucket quantity;
When first group factor of some logical micro-partition is much larger than first group factor of other logical micro-partitions, it will not be carried out at fractionation
Reason, Key is in this case used as by alternative attribute and carries out multidomain treat-ment;
(3) criterion of iterative data partition is determined
When logical micro-partition quantity is more, the big logical micro-partition of prioritizing selection member group factor, it is therefore an objective to by a fairly large number of differential of tuple
Area is transferred on the Reducer of dependence as soon as possible, and allows the logical micro-partition of tuple negligible amounts to transmit later so that effectively calculate with
Data transmissions Overlapped Execution, and help to reduce the memory space that Bucket takes;
It according to above method, will be arranged according to the still unappropriated logical micro-partition of tuple factor pair, and then select above some
Individual logical micro-partition is assigned on Reducer;
(4) the local data distribution situation with the overall situation of record
(4.1) local data's distribution condition is recorded
Logical micro-partition vector α have recorded first group factor vector of all logical micro-partitions in a Mapper current iteration round, by this
The data partition situation of the vector description Mapper current iterations;
(4.2) global data distribution condition is recorded
Corresponding component in α on each Mapper is added up to obtain the data distribution of the overall situation, represented with bivector AM:
AM=Σ αs(s∈[1,m])
At the end of each Mapper iteration, logical micro-partition vector α is sent to Master, then α is added in bivector AM,
The i.e. available data distribution situation global by current iteration round;
Because Reducer and Bucket is man-to-man, therefore the tuple quantity distributed on each Reducer can pass through statistics
Number of tuples measures on each Bucket, defines a data allocation vector AB:
AB=(B1,B2,B3,…,Bb)
Wherein Br=(f1,f2,f3,…,fn) (r ∈ [1, b]), the vector is for recording distribution of each logical micro-partition in Bucket
Situation:A Bucket is represented per a line, the element included per a line represents the index for being assigned to the Bucket codifferentials area
Value;
The member that each Bucket has been allocated is obtained by first group factor corresponding to the index factor of the every a line for the vectorial AB that adds up
Group sum;
(5) design data equilibrium partitioning algorithm
The task of data balancing partitioning algorithm be in the case where Future Data distribution situation is unknown, it is each according to having been dispensed into
Bucket logical micro-partition and its tuple quantity, p logical micro-partition selected by current iteration is distributed into b Bucket, and made point
The data volume of Bucket after matching somebody with somebody balances as far as possible;
Data distribution realizes that algorithm is as follows:
Step a1:Cumulative FG-Block logical micro-partition vector α obtains AM;
Step a2:Unappropriated logical micro-partition in AM is sorted from big to small;
Step a3:Bucket is sorted from small to large according to allocated data;
Step a4:Take out p unappropriated logical micro-partitions before AM, b Bucket before assigning them to;
Step a5:If there is next FG-Block, then terminate;Otherwise, a6 is gone to step;
Step a6:Bucket is sorted from small to large according to allocated data;
Step a7:Take out p unappropriated logical micro-partitions before AM, b Bucket before assigning them to;
Step a8:If unappropriated logical micro-partition in AM be present, a6 is gone to step;Otherwise, terminate;
The multiple fine-grained data block FG-Block obtained according to above procedure to subdivision are iterated processing so that are assigned to each
Being optimal of data balancing on individual Reducer, data skew is reduced, improve Spark data processing performance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710623289.6A CN107506388A (en) | 2017-07-27 | 2017-07-27 | A kind of iterative data balancing optimization method towards Spark parallel computation frames |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710623289.6A CN107506388A (en) | 2017-07-27 | 2017-07-27 | A kind of iterative data balancing optimization method towards Spark parallel computation frames |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107506388A true CN107506388A (en) | 2017-12-22 |
Family
ID=60690119
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710623289.6A Pending CN107506388A (en) | 2017-07-27 | 2017-07-27 | A kind of iterative data balancing optimization method towards Spark parallel computation frames |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107506388A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108572873A (en) * | 2018-04-24 | 2018-09-25 | 中国科学院重庆绿色智能技术研究院 | A kind of load-balancing method and device solving the problems, such as Spark data skews |
CN110264722A (en) * | 2019-07-03 | 2019-09-20 | 泰华智慧产业集团股份有限公司 | The screening technique and system of warping apparatus in information collecting device |
CN110309177A (en) * | 2018-03-23 | 2019-10-08 | 腾讯科技(深圳)有限公司 | A kind of method and relevant apparatus of data processing |
CN112000467A (en) * | 2020-07-24 | 2020-11-27 | 广东技术师范大学 | Data tilt processing method and device, terminal equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598567A (en) * | 2015-01-12 | 2015-05-06 | 北京中交兴路车联网科技有限公司 | Data statistics and de-duplication method based on Hadoop MapReduce programming frame |
US20150378696A1 (en) * | 2014-06-27 | 2015-12-31 | International Business Machines Corporation | Hybrid parallelization strategies for machine learning programs on top of mapreduce |
CN106126343A (en) * | 2016-06-27 | 2016-11-16 | 西北工业大学 | MapReduce data balancing method based on increment type partitioning strategies |
CN106502790A (en) * | 2016-10-12 | 2017-03-15 | 山东浪潮云服务信息科技有限公司 | A kind of task distribution optimization method based on data distribution |
-
2017
- 2017-07-27 CN CN201710623289.6A patent/CN107506388A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150378696A1 (en) * | 2014-06-27 | 2015-12-31 | International Business Machines Corporation | Hybrid parallelization strategies for machine learning programs on top of mapreduce |
CN104598567A (en) * | 2015-01-12 | 2015-05-06 | 北京中交兴路车联网科技有限公司 | Data statistics and de-duplication method based on Hadoop MapReduce programming frame |
CN106126343A (en) * | 2016-06-27 | 2016-11-16 | 西北工业大学 | MapReduce data balancing method based on increment type partitioning strategies |
CN106502790A (en) * | 2016-10-12 | 2017-03-15 | 山东浪潮云服务信息科技有限公司 | A kind of task distribution optimization method based on data distribution |
Non-Patent Citations (2)
Title |
---|
卞琛: "基于迭代渐进填充的内存计算框架分区映射算法", 《计算机应用》 * |
王卓: "基于增量式分区策略的MapReduce数据均衡方法", 《计算机学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110309177A (en) * | 2018-03-23 | 2019-10-08 | 腾讯科技(深圳)有限公司 | A kind of method and relevant apparatus of data processing |
CN110309177B (en) * | 2018-03-23 | 2023-11-03 | 腾讯科技(深圳)有限公司 | Data processing method and related device |
CN108572873A (en) * | 2018-04-24 | 2018-09-25 | 中国科学院重庆绿色智能技术研究院 | A kind of load-balancing method and device solving the problems, such as Spark data skews |
CN108572873B (en) * | 2018-04-24 | 2021-08-24 | 中国科学院重庆绿色智能技术研究院 | Load balancing method and device for solving Spark data inclination problem |
CN110264722A (en) * | 2019-07-03 | 2019-09-20 | 泰华智慧产业集团股份有限公司 | The screening technique and system of warping apparatus in information collecting device |
CN112000467A (en) * | 2020-07-24 | 2020-11-27 | 广东技术师范大学 | Data tilt processing method and device, terminal equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10218808B2 (en) | Scripting distributed, parallel programs | |
Gautam et al. | A survey on job scheduling algorithms in big data processing | |
CN104281652B (en) | Strong point data partition method one by one in metric space | |
CN107506388A (en) | A kind of iterative data balancing optimization method towards Spark parallel computation frames | |
CN104820708B (en) | A kind of big data clustering method and device based on cloud computing platform | |
US10901800B2 (en) | Systems for parallel processing of datasets with dynamic skew compensation | |
CN104809244B (en) | Data digging method and device under a kind of big data environment | |
Tang et al. | An intermediate data partition algorithm for skew mitigation in spark computing environment | |
Zhang et al. | Dart: A geographic information system on hadoop | |
Ibrahim et al. | Improvement of job completion time in data-intensive cloud computing applications | |
Kumar et al. | Big data streaming platforms: A review | |
CN109471877B (en) | Incremental temporal frequent pattern parallel mining method facing streaming data | |
Wang et al. | Fine-grained probability counting for cardinality estimation of data streams | |
Shin et al. | Cocos: Fast and accurate distributed triangle counting in graph streams | |
Aslam et al. | Pre‐filtering based summarization for data partitioning in distributed stream processing | |
Belcastro et al. | Evaluation of large scale roi mining applications in edge computing environments | |
Yu et al. | A MapReduce reinforced distributed sequential pattern mining algorithm | |
Hilbrich et al. | Order preserving event aggregation in TBONs | |
Abubaker et al. | Minimizing staleness and communication overhead in distributed SGD for collaborative filtering | |
Singh et al. | Estimating quantiles from the union of historical and streaming data | |
Wu et al. | Accelerating real-time tracking applications over big data stream with constrained space | |
Wang et al. | Skew‐aware online aggregation over joins through guided sampling | |
Li et al. | DSS: a scalable and efficient stratified sampling algorithm for large-scale datasets | |
Feng et al. | The edge weight computation with mapreduce for extracting weighted graphs | |
Wu et al. | Real-Time Search Method for Large-Scale Regional Targets Based on Parallel Google S2 Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171222 |