CN107506388A

CN107506388A - A kind of iterative data balancing optimization method towards Spark parallel computation frames

Info

Publication number: CN107506388A
Application number: CN201710623289.6A
Authority: CN
Inventors: 张元鸣; 蒋建波; 黄浪游; 沈志鹏; 项倩红; 肖刚; 陆佳炜; 高飞
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2017-07-27
Filing date: 2017-07-27
Publication date: 2017-12-22

Abstract

A kind of iterative data balancing partition method towards Spark parallel computation frames：First, big data coarseness Block is divided into fine-grained FG Block, logical micro-partition is created according to FG Block and logical micro-partition indexes；Secondly, the Bucket of equivalent is created according to Reducer quantity；3rd, determine the criterion of opportunity of iterative data partition, quantity and iterative subregion；4th, record each Bucket part and global data distribution condition；5th, the logical micro-partition of selection is assigned to by each Bucket according to data balancing partitioning algorithm and distribution condition；Finally, allocated data in Bucket are transferred to Reducer ends.The present invention proposes a kind of new data balancing partition method for Spark frameworks, reduces the data skew in big data processing procedure, improves the big data processing overall performance of Spark parallel computation frames.

Description

A kind of iterative data balancing optimization method towards Spark parallel computation frames

Technical field

The present invention relates to big data processing and high-performance computing sector, particularly proposes one kind and is counted parallel towards Spark Calculate the iterative data balancing optimization method of framework.

Background technology

MapReduce is a kind of parallel computational model for big data processing that Google companies proposed in 2004, By the way that running multiple tasks concurrently handles mass data simultaneously in a large amount of cheap cluster nodes, the property of processing data is improved Can, rapid development and extensive use had been obtained in more than ten years in past.Spark is a parallel computation based on MapReduce Framework, in the AMPLab development in laboratory by Univ California-Berkeley in 2009, there is the advantages of MapReduce, And task computation intermediate result is saved in internal memory, reduce disk read-write expense, improve the performance of big data processing, into For the main flow framework of current structure big data processing platform.

Data in big data processing procedure are unbalanced, also referred to as data skew, are to cause Spark frameworks overall performance to drop Low important bottleneck.Lin J(Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, 2009) research experiment result shows, when using the Hash given tacit consent at present During partition method, the Reducer having up to 92% generates the unbalanced phenomena of data, overall performance is reduced 22% and arrives 38%.

For the data skew problem in big data processing procedure, domestic and international researcher proposes different methods, greatly Cause can be summarized as following a few classes：(1) two benches data partition method：This method is by Gufler (Proceedings of the 1st International Conference on Cloud Computing and Services Science, 2011) etc. It is proposed, its elder generation according to disposable partition method generate data partition, and during operation analyze data partitioning scenario, such as Fruit run-off the straight, then the subregion larger to data volume split, and by the data point reuse of fractionation to the less subregion of data volume, So as to realize data balancing subregion.However, the validity of this method depend on adjustment subregion opportunity, too early split data volume compared with Big subregion can increase the possibility split by mistake, and splitting the larger subregion of data volume too late can cause data delay to transmit.(2) Multistage data partition method：This method proposes that it in the Map stages by generating carefully by Wang Zhuo etc. (Chinese journal of computers, 2016) Granularity subregion, the balance of subregion is then assessed by the Cost Model of definition during operation, when meeting a certain condition When the fine granularity of selection is assigned on Reducer, by multistage screening and distribution, data distribution is reached certain flat Weighing apparatus.It is difficult to hold however, this method carries out subregion opportunity, this does not have versatility.(3) sampled data partition method：This method By Ramakrishnan etc. (Proceedings of the 3rd ACM Symposium on Cloud Computering, 2012) propose, sampling is split and is combined by it with data, is increased an extra process in data processing implementation procedure and is responsible for Analyze data distribution situation, after data processing is to certain proportion, data are split according to the analysis result of sampling process And merging, i.e., the big subregion of data volume is split and merged with the less subregion of data volume；However, this method needs overhead To collect data distribution rule, by increase data access and data transfer overhead, moreover, data sampling is also uncertain in the presence of certain Property, if sampling is very few accuracy can be caused insufficient, sampling is excessive will to increase more overheads again；(4) delayed data point Area's method：This method is by (the Proceedings of the 1st ACM Symposium on Cloud such as Kwon Computing, 2010) propose, it assesses the size of data partition by defining Cost Model, is then assessed according to Cost Model Produce the size of data, and the log-on data subregion when task is run to sometime.But this method requires data transfer etc. It could be carried out after to subregion, therefore the time that delayed data is transmitted, substantial amounts of data need to wait at Mapper ends, make number It can not be carried out simultaneously with data transfer according to processing；(5) migrating data partition method：This method is also by (Proceedings such as Kwon Of the 2012 ACM SIGMOD internet Conference on Manangement of Data, 2012) propose, It completes the balance of node load by the migration of data not using data balancing subregion as target, by formulating cost mould Type, carries out remaining cost evaluation to the Reducer task not yet completed, and when meeting some requirements by the node not yet Processed Data Migration is on completed task node, so as to realize that each node performs the equilibrium of task.But this method Need to increase extra data transfer cost, also just delay the time of operation completion to a certain extent；

The content of the invention

In order to overcome the data skew problem in big data processing procedure, the present invention proposes a kind of parallel towards Spark The iterative data balancing partition method of Computational frame, each Reducer is assigned to by the subregion of big data in an iterative manner On so that the data volume handled on each Reducer reaches population equilibrium, improve Spark parallel computation frames to big data at The overall performance of reason.

A kind of iterative data balancing partition method towards Spark parallel computation frames, comprise the following steps：

(1) logical micro-partition and logical micro-partition index are created

(1.1) fine-grained data block is created

In Spark frameworks, the data processing unit of acquiescence is coarseness data block Block, is set as its size For 128M, to the further subdivisions of Block, multiple fine-grained data blocks (Fine-grained block, FG-Block) are obtained, it is right These FG-Block are iterated processing；

(1.2) logical micro-partition is created

In the Mapper stages, the tuple storage to being obtained after FG-Block conversion process merges into caching Buffer Key value identical tuples, the tuple-set after merging are referred to as logical micro-partition, and this is the base unit of iteration subregion；

(1.3) logical micro-partition index is created

Created and indexed according to the Key values of logical micro-partition, index is to be used to record which Reducer logical micro-partitions will be assigned to On structure；

(1.4) Bucket structures are designed

It is created that there is man-to-man rely on to close by Bucket, Reducer and the Bucket of equivalent according to Reducer quantity System, Bucket is logically divided into multiple Segment, and each Segment is merely able to store a logical micro-partition, will changed every time The tuple-set that generation obtains is referred to as logical micro-partition iteration block, and different iteration blocks are stored in Segment Slot, and multiple logical micro-partitions It can be assigned on identical Reducer；

(1.5) logical micro-partition vector is created

The tuple quantity of each logical micro-partition is referred to as first group factor, established between index and first group factor, distribution condition Association, create a recording indexes and its first group factor, the logical micro-partition vector α of distribution condition：

α=(t1, t2, t3 ..., tn)

, wherein ti=(index, factor, bno, sno) (i ∈ [1, n]), index is index value, and factor is tuple The factor, bno are Bucket numbering, and sno is Segment segment number；

(2) opportunity and the quantity of iterative data partition are determined

Completed when Mapper is handled FG-Chunk, after establishment obtains the index of logical micro-partition, proceed by current iteration Data distribution, the quantity of subregion is arranged to Bucket quantity；

When first group factor of some logical micro-partition is much larger than first group factor of other logical micro-partitions, it will not be split Processing, Key is in this case used as by alternative attribute and carries out multidomain treat-ment；

(3) criterion of iterative data partition is determined

When logical micro-partition quantity is more, the big logical micro-partition of prioritizing selection member group factor, it is therefore an objective to which tuple is a fairly large number of Logical micro-partition is transferred on the Reducer of dependence as soon as possible, and allows the logical micro-partition of tuple negligible amounts to transmit later so that effectively meter Calculation and data transmissions Overlapped Execution, and help to reduce the memory space that Bucket takes；

According to above method, it will be arranged, then be selected above according to the still unappropriated logical micro-partition of tuple factor pair Several logical micro-partitions are assigned on Reducer；

(4) the local data distribution situation with the overall situation of record

(4.1) local data's distribution condition is recorded

Logical micro-partition vector α have recorded first group factor vector of all logical micro-partitions in a Mapper current iteration round, lead to Cross the data partition situation of the vector description Mapper current iterations；

(4.2) global data distribution condition is recorded

Corresponding component in α on each Mapper is added up to obtain the data distribution of the overall situation, with bivector AM Represent：

AM=Σ α_s(s∈[1,m])

At the end of each Mapper iteration, logical micro-partition vector α is sent to Master, α is then added to bivector In AM, you can obtain by the global data distribution situation of current iteration round；

Because Reducer and Bucket is man-to-man, therefore the tuple quantity distributed on each Reducer can pass through Count number of tuples on each Bucket to measure, define a data allocation vector AB：

AB=(B₁,B₂,B₃,…,B_b)

Wherein B_r=(f₁,f₂,f₃,…,f_n) (r ∈ [1, b]), the vector is for recording each logical micro-partition in Bucket Distribution condition：A Bucket is represented per a line, the element included per a line, which represents, is assigned to the Bucket codifferentials area Index value；

Each Bucket is obtained by first group factor corresponding to the index factor of the every a line for the vectorial AB that adds up to be allocated Tuple sum；

(5) design data equilibrium partitioning algorithm

The task of data balancing partitioning algorithm is in the case where Future Data distribution situation is unknown, according to having been dispensed into Each Bucket logical micro-partition and its tuple quantity, p logical micro-partition selected by current iteration is distributed into b Bucket, and The data volume of the Bucket after distribution is set to balance as far as possible；

Data distribution realizes that algorithm is as follows：

Step a1：Cumulative FG-Block logical micro-partition vector α obtains AM；

Step a2：Unappropriated logical micro-partition in AM is sorted from big to small；

Step a3：Bucket is sorted from small to large according to allocated data；

Step a4：Take out p unappropriated logical micro-partitions before AM, b Bucket before assigning them to；

Step a5：If there is next FG-Block, then terminate；Otherwise, a6 is gone to step；

Step a6：Bucket is sorted from small to large according to allocated data；

Step a7：Take out p unappropriated logical micro-partitions before AM, b Bucket before assigning them to；

Step a8：If unappropriated logical micro-partition in AM be present, a6 is gone to step；Otherwise, terminate；

The multiple fine-grained data block FG-Block obtained according to above procedure to subdivision are iterated processing so that distribution Being optimal of data balancing on to each Reducer, data skew is reduced, improve Spark data processing performance.

It is an advantage of the invention that：

The present invention enters for the data skew problem in big data processing to the assigning process of Spark parallel computation frames Row optimization, realize the equilibrium assignment of big data：Coarseness data block is finely divided, place is iterated to fine-grained data block Reason, logical micro-partition is established according to Key values and logical micro-partition indexes, opportunity, condition, criterion and the quantity of logical micro-partition distribution are determined, with gradually The logical micro-partition of current iteration round is assigned to Reducer ends by the mode entered, and is realized the overall balanced of big data subregion, is improved The overall performance that Spark is handled big data.

Brief description of the drawings

Fig. 1 is the general structure of the iterative data balancing subregion of the present invention

Fig. 2 is the Bucket and Segment storage organizations of the present invention

Fig. 3 is the logical micro-partition data allocation process of the iteration twice of the present invention

Embodiment

Using the word count WordCount programs towards big data of classics as example, with reference to Fig. 1~Fig. 3, to the present invention Embodiment be further described：

Assuming that WordCount programs will carry out word statistics to 4 Block content, and assign it to 4 nodes On, totally 2 row, each Block data contents are as follows for data content in each Block：

Block1:

Spark is a fast and Spark is a general-purpose engine for large-scale data processing.

Spark runs programs faster than Hadoop MapReduce in memory and on disk.

Block2:

Spark performance is impacted by many soft system,hardware and dataset factors.

Spark can run both by itself,or over several existing cluster managers.

Block3:

Big Data can be defined as large data sets are being generated from different sources.

The use of the MapReduce and Spark are two approaches perform data analytics.

Block4:

Apache Spark is like the MapReduce model such that it is an open source framework.

Spark was developed within UC Berkeley's AMPLab and later released as open source.

(1) logical micro-partition and logical micro-partition index are created

(1.1) fine-grained data block is created

Fine-grained division is carried out to Block, it is assumed that per a line as a FG-Block, then the particulate obtained after segmenting The quantity of degrees of data block is 2；

(1.2) logical micro-partition is created

In the Mapper stages, the tuple storage to being obtained after every a line word conversion process is closed into caching Buffer And Key value identical tuples, the tuple-set after merging are referred to as logical micro-partition, this is the base unit of iteration subregion, with first Exemplified by individual Block：

The logical micro-partition that first FG-Block is obtained is：

(Spark,2)(is,2)(a,2)(fast,1)(and,1)(general-purpose,1)(engine,1)(for, 1)(large-scale,1)(data,1)(processing.,1)

The logical micro-partition that second FG-Block is obtained is：

(Spark,1)(runs,1)(programs,1)(faster,1)(than,1)(Hadoop,1)(MapReduce, 1)(in,1)(memory,1)(and,1)(on,1)(disk.,1)

(1.3) logical micro-partition index is created

(1.4) Bucket structures are designed

The Bucket of equivalent is created that according to Reducer quantity, now due to there are 4 Reducer, therefore creates 4 Bucket, Fig. 2 give Bucket and Segment storage organizations, and Bucket includes multiple elements, and each element stores logical micro-partition Index, pointer points to Segment, and Segment quantity determines by the quantity for being assigned to the Bucket codifferentials area；

(1.5) logical micro-partition vector is created

α initial values such as Block1 first FG-Block are：

The first row is the index value of logical micro-partition, such as corresponding 0, the is corresponding 1 of spark here, and the second row is the member of logical micro-partition Group factor, the number that the key occurs being represented, the third line is the Bucket marks that logical micro-partition is assigned to, and -1 represents unassigned, Fourth line is Segment of the logical micro-partition on Bucket value, and -1 represents not creating corresponding Segment；

(2) opportunity and the quantity of iterative data partition are determined

Completed when Mapper is handled FG-Chunk, after establishment obtains the index of logical micro-partition, proceed by current iteration Data distribution, the quantity of subregion is set 4；

(3) criterion of iterative data partition is determined

The criterion for the selection logical micro-partition taken is：The big logical micro-partition of prioritizing selection member group factor, to unappropriated logical micro-partition It is ranked up, obtains result and sort from big to small, has determined every wheel distribution 4 above, then selected from ordering logical micro-partition Go out 4 maximum logical micro-partitions of first group factor to be allocated；

(4) the local data distribution situation with the overall situation of record

(4.1) local data's distribution condition is recorded

Block1 first FG-Block α initial values are：

Block2 first FG-Block α initial values are：

Block3 first FG-Block α initial values are：

Block4 first FG-Block α initial values are：

(4.2) global data distribution condition is recorded

Corresponding component in α on each Mapper is added up to obtain the data distribution of the overall situation, with bivector AM Represent, now,

Data distribution vector AB initial value is：

Each Bucket is obtained by first group factor corresponding to the index factor of the every a line for the vectorial AB that adds up to be allocated Tuple sum, now each tuple total amounts of Bucket be (0,0,0,0), represent also do not there is logical micro-partition to be assigned on Bucket；

(5) design data equilibrium partition zone optimizing algorithm

Step a1：According to description above, global data cases AM to be allocated is had been obtained for, and it is allocated Data vector AB：

Step a2：Unappropriated logical micro-partition in AM is sorted from big to small：

Step a3：Now each Bucket data volume allocated size situation is (0,0,0,0).According to allocated data Bucket is sorted from small to large；

Step a4：Take out preceding 4 unallocated logical micro-partitions in AM to be assigned on preceding 4 Bucket, obtain：

The data volume allocated size situation for obtaining now each Bucket is (5,4,2,2)；

Step a5：Due to untreated FG-Block also be present, then algorithm terminates；

Proceeded as follows for untreated FG-Block：

Step a1：Each Block next FG-Block logical micro-partition vector α is obtained, and α is added to global change Measure on AM, obtain：

Step a3：Each Bucket data volume allocated size situation is (5,8,5,3).According to Bucket data volume pair Bucket sorts from small to large；

The data volume allocated size situation for obtaining now each Bucket is (7,10,7,6)

Step a5：Due to there is no untreated FG-Block under Block, a6 is gone to step；

Step a6：The data volume allocated size situation that each Bucket is obtained by AB is (7,10,7,6).According to Bucket data volume sorts from small to large to Bucket；

Step a7：Take out preceding 4 unallocated logical micro-partitions in AM to be assigned on preceding 4 Bucket, obtain

Step a8：Unappropriated logical micro-partition in AM also be present, go to step a6, until unappropriated differential is not present in AM Area；

Finally obtain：

So far logical micro-partition is all assigned, and in the absence of unappropriated logical micro-partition, processing terminates data volume on Bucket and distributed Situation is (26,29,26,25)；

Last Bucket data corresponding to being sent on Reducer, each Reducer data volume for (26,29, 26,25), counted on each Reducer in the key to oneself, statistics obtains result to the end after terminating：Word counts feelings Condition is：

(UC,1)(approaches,1)(being,1)(cluster,1)(developed,1)(existing,1) (for,1)(generated,1)(largescale,1)(many,1)(on,1)(perform,1)(released,1) (several,1)(sources.,1)(that,1)(was,1)(Berkeley's,1)(Data,1)(The,1)(an,1) (both,1)(different,1)(factors.,1)(framework.,1)(hardware,1)(it,1)(later,1) (memory,1)(performance,1)(run,1)(soft,1)(such,1)(within,1)(Apache,1)(Big,1) (be,1)(defined,1)(engine,1)(faster,1)(generalpurpose,1)(in,1)(large,1) (managers.,1)(of,1)(over,1)(programs,1)(sets,1)(source.,1)(than,1)(use,1) (AMPLab,1)(Hadoop,1)(analytics.,1)(dataset,1)(disk.,1)(fast,1)(from,1) (impacted,1)(itself,,1)(like,1)(model,1)(or,1)(processing.,1)(runs,1)(source, 1)(system,,1)(two,1)(by,2)(open,2)(can,2)(the,2)(are,2)(as,2)(a,2)(MapReduce, 3)(data,3)(is,5)(and,5)(Spark,8)

Under Spark default partition method, each Reducer data volume is (18,33,26,29), last word system The result being calculated is：

(two,1)(dataset,1)(hardware,1)(runs,1)(Big,1)(fast,1)(managers.,1) (developed,1)(later,1)(several,1)(analytics.,1)(framework.,1)(over,1) (performance,1)(model,1)(faster,1)(The,1)(different,1)(than,1)(AMPLab,1)(was, 1)(memory,1)(impacted,1)(perform,1)(sets,1)(in,1)(system,,1)(released,1) (disk.,1)(defined,1)(for,1)(both,1)(an,1)(itself,,1)(Hadoop,1) (generalpurpose,1)(approaches,1)(factors.,1)(UC,1)(soft,1)(sources.,1) (cluster,1)(Apache,1)(Data,1)(engine,1)(from,1)(within,1)(processing.,1)(it, 1)(existing,1)(run,1)(that,1)(source.,1)(on,1)(many,1)(be,1)(source,1)(such, 1)(or,1)(largescale,1)(large,1)(generated,1)(of,1)(like,1)(programs,1) (Berkeley's,1)(being,1)(use,1)(are,2)(can,2)(a,2)(the,2)(as,2)(open,2)(by,2) (MapReduce,3)(data,3)(is,5)(and,5)(Spark,8)

Remember data skewness K

To obtain each Redcuer data volumes be (18,33,26,29) to the subregion of Spark acquiescences, set forth herein data it is equal It is (26,29,26,25) that weighing apparatus optimization method, which obtains each Reducer data volumes, and two kinds of sides are calculated according to two Reducer The data skewness of method：

K_{Spark gives tacit consent to}=33/18 ≈ 1.83

K_{Spark optimizes}=29/25 ≈ 1.16

Contrast the two K be worth to set forth herein iterative data balancing optimization method processing data inclination on compare The partition method of Spark acquiescences improves 57.7%.Understand, set forth herein iterative data balancing optimization method can reduce Data skew, make load more balanced.

Claims

1. a kind of iterative data balancing optimization method towards Spark parallel computation frames, comprise the following steps：

(1) logical micro-partition and logical micro-partition index are created

(1.1) fine-grained data block is created

In Spark frameworks, the data processing unit of acquiescence is coarseness data block Block, as its size is traditionally arranged to be 128M, to the further subdivisions of Block, multiple fine-grained data blocks (Fine-grained block, FG-Block) are obtained, to this A little FG-Block are iterated processing；

(1.2) logical micro-partition is created

In the Mapper stages, the tuple storage to being obtained after FG-Block conversion process merges Key values into caching Buffer Identical tuple, the tuple-set after merging are referred to as logical micro-partition, and this is the base unit of iteration subregion；

(1.3) logical micro-partition index is created

Created and indexed according to the Key values of logical micro-partition, index will be assigned on which Reducer for recording logical micro-partition Structure；

(1.4) Bucket structures are designed

It is created that Bucket, Reducer and the Bucket of equivalent have man-to-man dependence according to Reducer quantity, Bucket is logically divided into multiple Segment, each Segment is merely able to store a logical micro-partition, by each iteration Obtained tuple-set is referred to as logical micro-partition iteration block, and different iteration blocks are stored in Segment Slot, and multiple logical micro-partitions can To be assigned on identical Reducer；

(1.5) logical micro-partition vector is created

The tuple quantity of each logical micro-partition is referred to as first group factor, associated in index with being established between first group factor, distribution condition, Create a recording indexes and its first group factor, the logical micro-partition vector α of distribution condition：

α=(t1, t2, t3 ..., tn),

Wherein ti=(index, factor, bno, sno) (i ∈ [1, n]), index are index value, and factor is first group factor, Bno is Bucket numbering, and sno is Segment segment number；

(2) opportunity and the quantity of iterative data partition are determined

Completed when Mapper is handled FG-Chunk, after establishment obtains the index of logical micro-partition, proceed by the number of current iteration According to distribution, the quantity of subregion is arranged to Bucket quantity；

When first group factor of some logical micro-partition is much larger than first group factor of other logical micro-partitions, it will not be carried out at fractionation Reason, Key is in this case used as by alternative attribute and carries out multidomain treat-ment；

(3) criterion of iterative data partition is determined

When logical micro-partition quantity is more, the big logical micro-partition of prioritizing selection member group factor, it is therefore an objective to by a fairly large number of differential of tuple Area is transferred on the Reducer of dependence as soon as possible, and allows the logical micro-partition of tuple negligible amounts to transmit later so that effectively calculate with Data transmissions Overlapped Execution, and help to reduce the memory space that Bucket takes；

It according to above method, will be arranged according to the still unappropriated logical micro-partition of tuple factor pair, and then select above some Individual logical micro-partition is assigned on Reducer；

(4) the local data distribution situation with the overall situation of record

(4.1) local data's distribution condition is recorded

Logical micro-partition vector α have recorded first group factor vector of all logical micro-partitions in a Mapper current iteration round, by this The data partition situation of the vector description Mapper current iterations；

(4.2) global data distribution condition is recorded

Corresponding component in α on each Mapper is added up to obtain the data distribution of the overall situation, represented with bivector AM：

AM=Σ α_s(s∈[1,m])

At the end of each Mapper iteration, logical micro-partition vector α is sent to Master, then α is added in bivector AM, The i.e. available data distribution situation global by current iteration round；

Because Reducer and Bucket is man-to-man, therefore the tuple quantity distributed on each Reducer can pass through statistics Number of tuples measures on each Bucket, defines a data allocation vector AB：

AB=(B₁,B₂,B₃,…,B_b)

Wherein B_r=(f₁,f₂,f₃,…,f_n) (r ∈ [1, b]), the vector is for recording distribution of each logical micro-partition in Bucket Situation：A Bucket is represented per a line, the element included per a line represents the index for being assigned to the Bucket codifferentials area Value；

The member that each Bucket has been allocated is obtained by first group factor corresponding to the index factor of the every a line for the vectorial AB that adds up Group sum；

(5) design data equilibrium partitioning algorithm

The task of data balancing partitioning algorithm be in the case where Future Data distribution situation is unknown, it is each according to having been dispensed into Bucket logical micro-partition and its tuple quantity, p logical micro-partition selected by current iteration is distributed into b Bucket, and made point The data volume of Bucket after matching somebody with somebody balances as far as possible；

Data distribution realizes that algorithm is as follows：

Step a1：Cumulative FG-Block logical micro-partition vector α obtains AM；

Step a3：Bucket is sorted from small to large according to allocated data；

Step a6：Bucket is sorted from small to large according to allocated data；

The multiple fine-grained data block FG-Block obtained according to above procedure to subdivision are iterated processing so that are assigned to each Being optimal of data balancing on individual Reducer, data skew is reduced, improve Spark data processing performance.