CN115858523A

CN115858523A - Hash Join execution method for detecting tilt data

Info

Publication number: CN115858523A
Application number: CN202211418045.1A
Authority: CN
Inventors: 陈磊; 魏可伟; 赵衎衎
Original assignee: Shanghai Yunxi Technology Co ltd
Current assignee: Shanghai Yunxi Technology Co ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-03-28

Abstract

The invention discloses a Hash Join execution method for detecting tilt data, which relates to the technical field of distributed databases and comprises the following steps: acquiring two input data sets; setting a relative inclination rate, calculating the product of the relative inclination rate and the data volume in one of the input data sets to obtain an inclination threshold value, wherein the value of the inclination threshold value exceeding the appearance frequency in the input data set is called as an inclination value; checking whether a Skew value exists in an input data set with large total data volume or not, if not, respectively performing Hash distribution on the two input data sets, if so, detecting all Skew values of the input data set in a certain field to obtain a Skew value list, then splitting each input data set into a Normal data set without the Skew value and a Sew data set with the Skew value based on the Skew value list, performing Hash distribution on the Normal data set, and performing average distribution or mirror distribution on the Sew data set; nodes in the cluster perform Hash Join calculations on the data thereon. The invention can improve the execution efficiency of the Hash Join.

Description

Hash Join execution method for detecting tilt data

Technical Field

The invention relates to the technical field of distributed databases, in particular to a Hash Join execution method for detecting tilt data.

Background

In the field of databases, particularly databases with query as a main function, the response speed of the database is an important factor influencing user experience, but the rapid development of the internet nowadays generates a large amount of data, and in the face of such a large amount of data, a distributed technology is widely adopted in the field of databases.

In the database's SQL engine there is a Join operator whose role is to Join computations, i.e., to find records where two sets are equal in some fields. The operator generally has two input data sets, each input data set comprises a plurality of records, each record comprises a plurality of fields, all the records have the same format, a data set is output after calculation of the Join operator, the data set comprises a plurality of records, each record is formed by combining one record in the two input data sets respectively, and the combination condition is that the two records have the same value in one or more fields.

There are generally three methods for implementing the calculation of the Join operator, namely, hash Join, mesh Join and Nested Loop Join, and the description is mainly made for the Hash Join. The principle of the naive Hash Join method is mainly divided into two stages: the method comprises the following steps of (1) table building step and (2) detection step. The two phases each use one input data set. In the table building phase, each record of the input data set is inserted into a hash table, where key in the hash table represents the value of the field to Join and value represents the record. In the detection stage, whether the value of the corresponding field of each record of the input data set exists in the hash table or not is inquired, and the existing record and the record are collected and combined to form a result record.

In a distributed environment, a plurality of computing nodes are often provided, and it is obviously not appropriate to use only one node to perform Hash Join, so that not only is the response speed poor, but also computing resources are wasted. Most current distributed databases do this by hashing all records of two input data sets onto multiple compute nodes using the same Hash function acting on specified fields, and then performing a naive Hash Join computation. The method is simple in logic, has certain deployment advantages on large-scale database engineering projects, obviously improves the calculation performance compared with single-node Join, and seriously reduces the performance improvement under the condition of input data inclination. When the input data set is inclined, a large number of same values can be obtained on a certain field, the same calculation results can be obtained when the same Hash function acts on the values, so that the same calculation results are distributed to the same calculation node, and finally, the situation that the input data of the node is many and the input data of other nodes is few is formed, so that the execution time of the whole Hash Join is prolonged. In response to this problem, experts in the current field have also studied some solutions, and it is mature that a scheme based on statistical information, such as "Xu Y, kostamaa P, zhou X, et al. Handling data skew in parallel joints in shared-not-doing systems [ C ]// Proceedings of the 2008acm sigma signal international conference on Management of data.2008", proposes an algorithm of PRPD, which specially processes the tilt data of two input data sets, but more so, it is a starting point for the purpose of reducing the network transmission amount, and it is assumed that the tilt data is known.

In contrast, a Hash Join execution method for detecting tilt data is provided for a general data scene, namely, the situation that whether the data is tilted or not and the specific tilt degree are unknown.

Disclosure of Invention

The invention provides a Hash Join execution method for detecting tilt data, which aims to solve the problem that a distributed Hash Join operator is poor in performance on the tilt data.

The invention discloses a Hash Join execution method for detecting tilt data, which adopts the following technical scheme for solving the technical problems:

a Hash Join execution method for detecting tilt data, the implementation comprising:

before executing Hash Join by using an SQL engine of a database, acquiring two input data sets of the Join operator;

setting a relative inclination rate, calculating the product of the relative inclination rate and the data volume in one of the input data sets to obtain an inclination threshold value, wherein the value of the inclination threshold value exceeding the appearance frequency of the two input data sets is called as an inclination value;

for both input data sets, checking whether a skew value exists in the input data set having a large total data amount,

if not, respectively carrying out hash distribution on the two input data sets so as to split the two input data sets and respectively distribute the two input data sets to the Join node cluster,

if the input data set exists, detecting all tilt values of the input data set with large total data volume in a certain field to obtain a tilt value list, splitting each input data set into a Normal data set without tilt values and a Skew data set with tilt values based on the tilt value list, carrying out Hash distribution on the Normal data set to split and distribute the Normal data set into a Join node cluster, carrying out average distribution or mirror distribution on the Skew data set to correspondingly distribute the Skew data set to all nodes of the Join node cluster after splitting and average distribution or copying;

nodes in the Join node cluster perform Hash Join calculation on data on the nodes.

Specifically, for two input datasets, an input dataset having a large total data amount is referred to as a Big dataset, and a dataset having a Small total data amount is referred to as a Small dataset.

More specifically, calculating the product of the relative inclination rate and the sampling data quantity in the Big data set to obtain an inclination threshold value, wherein the value of the inclination threshold value exceeding the appearance frequency in the Big data set and the Small data set is called the inclination value;

and using a checker to check whether the Big data set has a tilt value in a certain field, and if the Big data set has the tilt value, using a detector to count all the tilt values of the Big data set in the certain field to obtain a tilt value list.

Preferably, the checker first sequentially samples the Big data set, calculates the amount and relative tilt rate of the sampled data to obtain the tilt threshold, and then checks whether the tilt value exists in the sampled data.

More specifically, according to the tilt value list, splitting all tilt values in the Big data set into a Skaew 1 data set, splitting the rest data in the Big data set into a Normal 1 data set, simultaneously splitting all tilt values in the Small data set into a Skaew 2 data set, and splitting the rest data in the Small data set into a Normal 2 data set;

hash router is used for carrying out Hash distribution on the Normal 1 data set and the Normal 2 data set respectively, so that the Normal 1 data set and the Normal 2 data set are split and distributed to the Join node cluster respectively, average router is used for carrying out average distribution on the Skaw 1 data set, so that the Skaw 1 data set is split and distributed to all nodes of the Join node cluster averagely, mirror router is used for carrying out mirror image distribution on the Skaw 2 data set, so that the Skaw 2 data set is copied into multiple parts and distributed to all nodes of the Join node cluster correspondingly.

Preferably, when the hash router is used to perform hash distribution on the Normal 1 dataset and the Normal 2 dataset, the same hash function is used to perform calculation on some fields in the Normal 1 dataset and the Normal 2 dataset, the Normal 1 dataset and the Normal 2 dataset are split into a plurality of subsets according to the calculation result, and the plurality of subsets of the Normal 1 dataset and the plurality of subsets of the Normal 2 dataset are distributed to corresponding nodes of the Join node cluster respectively.

Preferably, when the average router is used to distribute the Skew 1 data set evenly, the Skew 1 data set is split into a plurality of subsets according to a certain field, in this case, a plurality of pieces of data with the same value on the field are distributed evenly on the plurality of subsets, and then the plurality of subsets are distributed evenly on all nodes of the Join node cluster.

Preferably, the relative tilt rate is a value less than 1, and the smaller the relative tilt rate, the more tilt values in the input data set, and conversely, the larger the relative tilt rate, the less tilt values in the input data set.

Compared with the prior art, the Hash Join execution method for detecting the tilt data has the following beneficial effects that:

the invention can realize data skew detection, and adopts Hash distribution, mirror image distribution and average distribution after splitting the input data set containing skew data to distribute the data to the computing nodes of the Join node cluster, thereby improving the execution efficiency of Hash Join in the distributed database through the balanced distribution of tasks.

Drawings

FIG. 1 is a flowchart illustrating steps S1-S3 according to a first embodiment of the present invention;

FIG. 2 is a flowchart of the step S4 according to the first embodiment of the present invention;

FIG. 3 is a flowchart of the step S5 according to the first embodiment of the present invention;

fig. 4 is a flowchart of steps S6 to S7 according to a first embodiment of the present invention.

Detailed Description

In order to make the technical solutions, technical problems to be solved, and technical effects of the present invention more clearly apparent, the following description clearly describes the technical solutions of the present invention in combination with specific embodiments.

The first embodiment is as follows:

with reference to fig. 1-4, this embodiment provides a Hash Join execution method for detecting tilt data, which includes:

step S1, before executing Hash Join by using an SQL engine of a database, acquiring two input data sets of the Join operator, wherein the input data set with large total data volume is called a Big data set, and the data set with Small total data volume is called a Small data set.

And S2, setting a relative inclination rate, wherein the product of the relative inclination rate and the total data volume is an inclination threshold value based on the total data volume of the Big data set, and the value of the appearance frequency exceeding the inclination threshold value in the Big data set and the Small data set is called an inclination value.

And S3, sequentially sampling the Big data set by using a checker, calculating the amount and the relative inclination rate of the sampled data to obtain an inclination threshold, checking whether the inclination value exists in the sampled data, executing the step S4 if the inclination value does not exist, and executing the step S5 if the inclination value exists.

And S4, respectively carrying out hash distribution on the Big data set and the Small data set by using a hash router so as to split the Big data set and the Small data set and respectively distribute the split data sets to the Join node cluster, and skipping to execute the step S8.

And S5, detecting all the inclination values of the Big data set in a certain field by using a detector to obtain an inclination value list, and sequentially executing the step S6.

And S6, splitting all the Skew values in the Big data set into a Skaew 1 data set, splitting the rest data in the Big data set into a Normal 1 data set, splitting all the Skew values in the Small data set into a Skaew 2 data set, and splitting the rest data in the Small data set into a Normal 2 data set according to the Skew value list.

S7, respectively carrying out hash distribution on the Normal 1 data set and the Normal 2 data set by using a hash router so as to split the Normal 1 data set and the Normal 2 data set and respectively distribute the split data to the Join node cluster;

using an average router to averagely distribute the Skaew 1 data set so as to split the Skaew 1 data set and averagely distribute the Skaew 1 data set to all nodes of the Join node cluster;

mirror image distribution is carried out on the Skaew 2 data set by using a mirrorrouter, so that the Skaew 2 data set is copied into multiple copies and correspondingly distributed to all nodes of the Join node cluster.

And S8, carrying out Hash Join calculation on the data on the nodes in the Join node cluster.

It should be noted that the relative tilt rate is a value smaller than 1, and the smaller the relative tilt rate, the more tilt values in the input data set, and conversely, the larger the relative tilt rate, the less tilt values in the input data set.

For convenience, the implementation of the flow from step S5 to step S7 in this embodiment is referred to as Detect Join process.

For the implementation process of this embodiment, a distributed Hash Join calculation method widely used by the current distributed database is theoretically compared with the method of this embodiment: assuming that the data size of the Big dataset is m and the data size of the Small dataset is n, the time complexity of the standalone Hash Join is O (m + n). For the distributed Hash Join operator, assuming that k computing nodes exist, the scale of the Big data sets on the k nodes is m ₁ ，m ₂ ...m _k Small dataSet sizes are n ₁ ，n ₂ ...n _k Then the time complexity is O (max (m) _i )+max(n _i )). For Detect Join, because the tilt values in the Big data set are evenly distributed over k compute nodes, the whole Big data set is evenly distributed over k compute nodes, so the time complexity is

Is obviously->

Therefore, the Detect Join performance of the embodiment is better than that of the distributed Hash Join operator, and is equal to max (m) _i ) The larger the performance improvement is.

After the theoretical analysis is completed, taking the example that the Join node cluster comprises three computing nodes, and then performing experimental verification on the distributed Hash Join computing method widely adopted by the current distributed database and the method of the embodiment.

The degree of inclination is denoted by the letter p, the degree of inclination = total amount of inclination value in data set/amount of data in data set, where p1 denotes the degree of inclination of Small data set and p2 denotes the degree of inclination of Big data set, and the numerical unit is ms.

(one) suppose that the aggregate data amount of Big data is three times that of Small data

p1=0, and p1< p2, the experimental results are shown in table 1:

	p2＝0	p2＝0.2	p2＝0.4	p2＝0.6	p2＝0.8	p2＝1
							current Hash Join	4061	4022	3956	4098	4252	4347
Detect Join	4144	3780	3734	3457	3446	3434

When p1= p2= p, the experimental results are shown in table 2:

	p＝0	p＝0.2	p＝0.4	p＝0.6	p＝0.8	p＝1
							current Hash Join	4061	3788	3764	3772	3673	3599
Detect Join	4144	3523	3500	3553	3627	3641

p2=0, p1> p2, the experimental results are shown in table 3:

	p1＝0	p1＝0.2	p1＝0.4	p1＝0.6	p1＝0.8	p1＝1
							current Hash Join	4061	3772	3718	3583	3546	3393
Detect Join	4144	3865	3818	3473	3325	3192

As can be seen from the data in tables 1 to 3, as long as p2 is greater than 0, the performance of the method of the embodiment is better than that of the distributed Hash Join widely adopted in the current distributed database.

(II) suppose that the Small data lumped data amount is 0.8 times of the Big data lumped data amount

p1=0, and p1< p2, the experimental results are shown in table 4:

	p2＝0	p2＝0.2	p2＝0.4	p2＝0.6	p2＝0.8	p2＝1
							current Hash Join	7112	7226	7887	7978	8233	8050
Detect Join	7527	7782	7347	6279	6164	6212

When p1= p2= p, the experimental results are shown in table 5:

	p＝0	p＝0.2	p＝0.4	p＝0.6	p＝0.8	p＝1
							current Hash Join	7112	7750	7064	6660	5524	4682
Detect Join	7527	6971	6434	5752	4806	4756

p2=0, p1> p2, the experimental results are shown in table 6:

	p1＝0	p1＝0.2	p1＝0.4	p1＝0.6	p1＝0.8	p1＝1
							current Hash Join	7112	7349	7078	6273	5615	5157
Detect Join	7527	7362	6361	5927	5346	5325

Looking at the data in tables 4-6, as long as p2>0, the Detect Join performance of the embodiment will perform better than the distributed Hash Join performance widely adopted by the current distributed database.

By combining the above experimental data, in a general data scene, that is, in a case that whether the data is tilted or not and the specific tilt degree are unknown, by using the method of this embodiment, tilt detection is performed on the data of the input data set, and then the subsequent steps S4 to S8 are performed according to whether the input data set contains tilt data or not, so that the execution efficiency of the Hash Join in the distributed database is improved.

In summary, the Hash Join execution method for detecting skewed data according to the present invention can perform skew detection on data, and can also adopt different distribution means for an input data set containing skewed data to distribute data to computing nodes of a Join node cluster, thereby improving the execution efficiency of Hash Join in a distributed database.

The principles and embodiments of the present invention have been described in detail using specific examples, which are provided only to aid in understanding the core technical content of the present invention. Based on the above embodiments of the present invention, those skilled in the art should make any improvements and modifications to the present invention without departing from the principle of the present invention, and therefore, the present invention should fall into the protection scope of the present invention.

Claims

1. A Hash Join execution method for detecting tilt data, the method comprising:

for two input data sets, checking whether a tilt value exists in the input data set having a large total data amount,

2. The Hash Join execution method for detecting oblique data according to claim 1, wherein for two input data sets, the input data set with larger total data amount is called Big data set, and the data set with smaller total data amount is called Small data set.

3. The Hash Join implementation method for detecting tilt data according to claim 2, wherein the product of the relative tilt rate and the amount of sampled data in the Big data set is calculated to obtain the tilt threshold, and the values of the Big data set and Small data set whose occurrence frequency exceeds the tilt threshold are called tilt values;

4. The Hash Join implementation method for detecting tilt data as claimed in claim 3, wherein the checker first sequentially samples the Big data set, calculates the amount of the sampled data and the relative tilt rate to obtain the tilt threshold, and then checks whether the tilt value exists in the sampled data.

5. The Hash Join execution method for detecting Skew data according to claim 3, wherein according to the list of Skew values, all the Skew values in Big dataset are split into Sew 1 dataset, the rest of data in Big dataset are split into Normal 1 dataset, at the same time, all the Skew values in Small dataset are split into Sew 2 dataset, and the rest of data in Small dataset are split into Normal 2 dataset;

6. The Hash Join execution method for detecting skewed data according to claim 5, wherein when a Hash router is used to Hash and distribute the Normal 1 dataset and the Normal 2 dataset, the same Hash function is used to perform computation on some fields in the Normal 1 dataset and the Normal 2 dataset, the Normal 1 dataset and the Normal 2 dataset are respectively split into a plurality of subsets according to the computation result, and the plurality of subsets of the Normal 1 dataset and the plurality of subsets of the Normal 2 dataset are respectively distributed to corresponding nodes of the Join node cluster.

7. The method as claimed in claim 6, wherein when averaging distribution is performed on the Skew 1 data set, the Skew 1 data set is divided into a plurality of subsets according to a field, and then a plurality of subsets are averaged to all nodes of the Join node cluster.

8. The Hash Join implementation method for detecting tilt data as claimed in claim 1, wherein the relative tilt rate is a value less than 1, the smaller the relative tilt rate, the more tilt values in the input data set, and vice versa, the larger the relative tilt rate, the less tilt values in the input data set.