CN105701209A

CN105701209A - Load balancing method for improving parallel connection performance on big data

Info

Publication number: CN105701209A
Application number: CN201610019840.1A
Authority: CN
Inventors: 葛微; 李先贤; 王利娥
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2016-01-13
Filing date: 2016-01-13
Publication date: 2016-06-22

Abstract

The invention discloses a load balancing method for improving parallel connection performance on big data. The load balancing method comprises the following steps of 1) initializing and dividing mass data into data blocks according to a query result, wherein the divided data blocks comprise a plurality of table data which participate in connection and of which connection attribution conforms to a query condition, and the table data is organized and managed in a mode of data blocks; 2) recording access rate of the data blocks, accumulating heat degrees of the data blocks, and calculating average access duty ratio; 3) carrying out self-adaptive adjustment on the division of the data blocks according to the average access duty ratio of the data blocks, and triggering the combination and split of the data blocks according to a fitting degree of a query request; 4) uniformly distributing heat data onto each cluster node, and carrying out load balancing on a query task according to the heat degree; and 5) connecting the query request to be executed on each cluster node, counting connection results, and retuning the connection results to a client. By the method, the time efficiency of data query is improved, the balance of a connection query load is achieved, and the performance of connection operation is improved.

Description

A kind of improve the balancing method of loads of parallel join performance in big data

Technical field

The present invention relates to parallel load balancing technique in big data, specifically a kind of improve the balancing method of loads of parallel join performance in big data。

Background technology

Concatenation operation on relational database is to merge two tables in the horizontal direction, is namely combined by those row being mutually matched on corporate data item in two tables。In relational algebra, concatenation operation is chosen computing by a cartesian product computing and one and is constituted。First complete the multiplication to two data acquisition systems with cartesian product, then the results set generated is chosen computing, it is ensured that only respectively from two data acquisition systems and there is the row of lap combine。The whole meanings connected are in that: merge two data acquisition systems (usually table) in the horizontal direction, and produce a new results set, and its method is that the row that it mates by the row in a data source and the neutralization of another data source is combined into a new tuple。

Having urgent Connection inquiring demand in big data: the static datas such as user profile are saved in tables of data, the data such as click steam daily record and business diary are constantly be generated in a steady stream and accumulate。Analysis based on daily record data needs merging static data and dynamic data to be attached, and the result based on concatenation operation proceeds depth analysis。But, attended operation is the query manipulation that in data analysis, cost is significantly high, and the most original method needs all row of two tables are carried out cartesian product computing。On relational database, it is always up the focus of research for the optimisation strategy research of concatenation operation。Towards current big data environment, Connection inquiring optimization is more very urgent。Concatenation operation cost in mass data is excessive causes that the result of data analysis is substantially delayed。

The modal optimization means of concatenation operation under parallel computation environment is parallelization, will connect task distribution on each node of cluster, and allow concatenation operation executed in parallel, then summarized results on each node。Wherein, the load balance of data fragmentation and task is emphasis and the difficult point of algorithm: how to be distributed on each node of cluster by data, making for connection task, the data being distributed in each node can locally execute concatenation operation, and the task of clustered node can balancedly perform on each node。

For Nature Link (equivalent connection), typical method is that Hash connects。Concatenation operation has an important feature, namely participate in two data source R and S of concatenation operation on connection attribute, only to meet equal condition just can produce connection result, therefore, two data source R and S are carried out burst with identical hash function on connection attribute, can ensure that the data being only mapped to same node may produce connection result, and the data being mapped on different node do not have connection result。Based on such premise, Hash connects through data fragmentation and greatly reduces the cost of data cube computation under distributed environment。But, Hash connects the equalization problem not accounting for load, there is each node and calculate the unbalanced phenomenon of task, this can make the performance of parallel join have a greatly reduced quality, and what connect task completes to depend on the tasks carrying deadline of the slowest computing node in distributed environment。

2006, Google delivered MapReduce paper, it is proposed that the executed in parallel of United Dispatching processes framework。MapReduce can detect in cluster and perform slower task node, and when there being node to be finished, it can be assigned with together with slower task node and perform its task。Such scheduling strategy can improve executed in parallel and calculate the performance of task, is the remedial measure when task manager detects running delay。But, under big market demand scene, it is universal phenomenon that the access of data tilts, and a part of data are fairly frequently accessed, and major part data are seldom accessed。MapReduce parallel computation frame there is presently no and considers that the access frequency (task load in reflection data query) based on data carrys out scheduler task, so the load balance of task always just starts to remedy when node performs tilt phenomenon occur。

Summary of the invention

The present invention is directed to the deficiencies in the prior art, and provide a kind of and improve the balancing method of loads of parallel join performance in big data。This method can reduce the expense of data management and maintenance, improve the response time of data query, simultaneously, can accurately catch the access regularity of distribution of data, the access regularity of distribution according to data weighs data access load, realizing the balance of the query load of Connection inquiring and the balance of data distribution, making task uniform load between each node, thus improving the performance of concatenation operation。

The technical scheme realizing present invention is:

A kind of improve the balancing method of loads of parallel join performance in big data, comprise the steps:

1) mass data is divided into data block according to Query Result initialization: first the result " gathering in bulk " every time inquired about is divided data, this data block of the metadata record of management data block span on connection attribute, i.e. the start-stop value of a successive range；Data block after division includes the multiple table data participating in connection meeting querying condition on connection attribute, and data only just can be divided in bulk until first time is queried hit, and organizes in the way of data block；

2) record the accesss accounting rate of data block, calculate average access accounting rate, total temperature of accumulation node and data total amount: when inquiring about the access to data block and being all hit, the temperature of data block cumulative 1, namely 100%, when inquiring about the access to data block and being partial hit, value between the cumulative 0-1 of the temperature of data, namely data block is accessed for percentage ratio；

The record sum of the record strip number/data block in access accounting rate=this access data block；

The temperature of data block=；

Average access accounting rate=The accessed number of times of/data block；

3) according to data block average access accounting rate in continuous Query, the division of data block is carried out self-adaptative adjustment, weigh data with this and divide and the fitting degree of inquiry request, come the merging of trigger data block, division according to the fitting degree of inquiry request；

4) data are uniformly distributed on each node of cluster so that on each node, the data of distribution all keep in a basic balance in space expense and temperature；What the temperature of data represented is the query load of Deta bearer, the data block that the data block that temperature is high is namely accessed frequently, after the self-adaptative adjustment of split degree, the cold and hot degree of data is weighed exactly by data temperature, on this basis, it is possible to realize the dynamic adjustment of Clusters Load Balance in real time, accurately and efficiently；Connection calculating is uniformly shared by each node in the cluster and parallelization performs, it is possible to makes the parallelization performance of cluster reach optimum；

5) task is distributed to each node of cluster equalizedly, and Connection inquiring request performs on each node of cluster, finally collects connection result and returns client。

This method optimizes the performance of Connection inquiring according to the access frequency of data and calculation cost, catches the access regularity of distribution with fitting data by recording the access frequency of data, and data are divided into according to data access rule the data block of different temperature。Space expense and temperature based on data, data are distributed on different node by we, the load making data access can be evenly distributed in the cluster, and such load balance scheduling strategy towards access frequency can make full use of the storage resource of each node of cluster and calculate resource。

This method divides data into block, data block skip list in addition organization and management, it is possible to reduce the expense of data management and maintenance, improve the response time of data query；Simultaneously, by the self-adaptative adjustment of deblocking, can accurately catch the access regularity of distribution of data, the access regularity of distribution measurement data access load according to data, realize the balance of the query load of Connection inquiring and the balance of data distribution, make task uniform load between each node, thus improve the performance of concatenation operation。

Accompanying drawing explanation

Fig. 1 is embodiment method flow schematic diagram；

Fig. 2 is that embodiment data are divided into continuous blocks schematic diagram；

Fig. 3 is that embodiment data block 37 divides schematic diagram；

Fig. 4 is that embodiment data block 32 and 37 merges schematic diagram；

Fig. 5 is that embodiment data divide fitting data access rule schematic diagram。

Detailed description of the invention

Below in conjunction with drawings and Examples, present disclosure is further elaborated, but is not limitation of the invention。

Embodiment:

Referring to Fig. 1, a kind of improve the balancing method of loads of parallel join performance in big data, comprise the steps:

1) mass data is divided into data block according to Query Result initialization: first the result " gathering in bulk " every time inquired about is divided data, this data block of the metadata record of management data block span on connection attribute, i.e. the start-stop value of a successive range；Data block after division includes the multiple table data participating in connection meeting querying condition on connection attribute, and data only just can be divided in bulk until first time is queried to, and organizes in the way of data block；

Referring to Fig. 2, if the querying attributes value scope of data is at 7-99；First the result " gathering in bulk " every time inquired about is divided data by the present embodiment, 7-13, 21-31, 32-36, 37-70, this 5 blocks of data of 85-99 is divided into data block when being queried hit first, the data block being divided into includes the multiple table data participating in connection meeting querying condition on connection attribute, two tables such as participating in connection are table R and table S, then the querying attributes value scope of table R and table S is all divided in same data block in the data of 7-13, this data block of the metadata record of management data block span on connection attribute, the i.e. start-stop value of a successive range, such as 37-70, data are divided in bulk when first time is queried hit, manage with block；But without the data being queried hit, this two segment data of such as 14-20,71-84 then will not by block management data；

2) record the access ratio of data block, calculate average access accounting rate, total temperature of accumulation node and data total amount: when the access of data block is all hit by inquiry, the temperature of data block cumulative 1, namely 100%, when inquiring about the access to data block and being partial hit, value between the cumulative 0-1 of the temperature of data, namely data block is accessed for percentage ratio；

The temperature of data block=；

Average access accounting rate=The accessed number of times of/data block；

3) division of data block is carried out self-adaptative adjustment by the average access accounting rate according to data block: data access is required for recording the access accounting rate of accessed data block every time, weigh data with this and divide the fitting degree with inquiry request, come the merging of trigger data block, division according to the fitting degree of inquiry request；The division of data block is carried out self-adaptative adjustment by the average access accounting rate according to data block, has divided, by merging, the self-adaptative adjustment that data divide, and the data block obtaining meeting the data access regularity of distribution divides；

The self-adapting regulation method of data block is as follows: if the average access accounting rate of data block is very low, data block 37 in such as Fig. 2, illustrates that the data record dependency in data block is only small for inquiry request sequence, often hit by fraction, need not flock together, it is therefore desirable to the division of trigger data block, referring to Fig. 3, data block 37 is split into three isometric data block: 37-47,48-58,59-70, temperature and access times are all inherited from former data block 37；

And if two continuous print data block temperatures and access times are closely, then the accessed data block of these continuous print is often hit simultaneously, and namely the dependency of data block is strong, it is necessary to the merging of trigger data block；Data block 32 in such as Fig. 3 and data block 37, after 2000 queried accesses, the access times of data block 32 and 37 and data block temperature all closely, trigger the merging of two data blocks, referring to Fig. 4；

Division and merging are the means that data block carries out self-adaptative adjustment, and through self-adaptative adjustment, the regularity of distribution dividing meeting progressively matching inquiry request sequence of data block, referring to Fig. 5；Data are organization and management in the way of data block, significantly reduces time and the space cost of management；

4) data are uniformly distributed on each node of cluster, the data temperature summation and the memory space summation that make distribution on each node keep in a basic balance: through starting stage several times split degree, general access through 2000-5000 time can reach to stablize, after data access, data divide and are trained to meet data access rule, division and merging all can be greatly decreased, then can data be distributed on each node of cluster；Referring to Fig. 4, data block 7,21,32,48,59,85 is all dsc data block, and data block 14 and 71 is cold data block, and their temperature is respectively as follows:

Data block 7 temperature=27.5,

Data block 14 temperature=0.8,

Data block 21 temperature=10.3,

Data block 32 temperature=10.875,

Data block 48 temperature=3.5,

Data block 59 temperature=3.5,

Data block 71 temperature=0.3,

Data block 85 temperature=25.8,

Their temperature is sorted, and is distributed on each node of cluster respectively, then:

Having data block 7 and 71 on node 1, temperature summation is 30.5,

Having data block 85,14 and 48, temperature summation on node 2 is 30.1,

Having data block 21,32 and 59, temperature summation on node 3 is 24.675；

Storage and management in big data need to pay close attention to dsc data, because dsc data is accessed frequently, the cost that dsc data is accessed largely affects systematic function, and the concatenation operation on dsc data is by each node equally loaded, and system can obtain optimal performance when load balancing；

5) task is born by each node equilibrium of cluster, and Connection inquiring request performs on each node of cluster, finally collects connection result and returns client。

Claims

1. improve a balancing method of loads for parallel join performance in big data, comprise the steps:

Mass data is initialized according to Query Result and is divided into data block: first the result " gathering in bulk " every time inquired about is divided data, this data block of the metadata record of management data block span on connection attribute, i.e. the start-stop value of a successive range；Data block after division includes the multiple table data participating in connection meeting querying condition on connection attribute, and data only just can be divided in bulk until first time is queried to, and organizes in the way of data block；

The record access ratio of data block, the temperature of accumulation data block, calculating average access accounting rate: when inquiring about the access to data block and being all hit, the temperature of data block cumulative 1, when inquiring about the access to data block and being partial hit, value between the cumulative 0-1 of the temperature of data, namely data block is accessed for percentage ratio；

The temperature of data block=；

Average access accounting rate=The accessed number of times of/data block；

The division of data block is carried out self-adaptative adjustment by the average access accounting rate according to data block: data access is required for recording the access accounting rate of accessed data block every time, weigh data with this and divide the fitting degree with inquiry request, come the merging of trigger data block, division according to the fitting degree of inquiry request；

Data are uniformly distributed on each node of cluster so that the data temperature summation being distributed on each node and memory space summation keep in a basic balance, do the load balance of query task according to temperature；

Task is distributed to each node of cluster equalizedly, and Connection inquiring request performs on each node of cluster, finally collects connection result and returns client。