CN106021386A

CN106021386A - Theta-join method for massive distributed data

Info

Publication number: CN106021386A
Application number: CN201610312145.4A
Authority: CN
Inventors: 刘文洁; 李占怀; 潘巍; 张晓�
Original assignee: Northwestern Polytechnical University
Current assignee: Yunyao Technology (Zhejiang) Co.,Ltd.
Priority date: 2016-05-12
Filing date: 2016-05-12
Publication date: 2016-10-12
Anticipated expiration: 2036-05-12
Also published as: CN106021386B

Abstract

The invention discloses a theta-join method for massive distributed data, which is used for solving the technical problem of low efficiency of the conventional theta-join method. The method adopts the technical scheme that prior to theta-join of two tables, appropriate filtering rules are firstly selected according to join conditions, the maximum values and the minimum values of join fields of the two tables are then calculated, all records in the two tables are scanned according to the maximum values and the minimum values, the records irrelevant to output results are removed through filtering, Cartesian product calculation is only carried out for filtered data, and a secondary comparison of Cartesian product results is finally carried out according to the join conditions, so that the records satisfying the join conditions are obtained through screening. The method has the advantages that a large number of the records which fail to satisfy the join conditions are removed through the filtering prior to the Cartesian product calculation, so that the workload of a Reducer is effectively reduced, and the theta-join query efficiency is improved.

Description

Non-equivalent method of attachment towards magnanimity distributed data

Technical field

The present invention relates to a kind of non-equivalent method of attachment, particularly to a kind of non-equivalent towards magnanimity distributed data even Connect method.

Background technology

Under cloud computing environment, data are stored, process and analyze and bring new challenge by the explosive growth of data volume. Traditional data base and data processing method cannot meet storage and the process demand of big data, at present, the process of main flow Method is that the parallel processing technique using MapReduce is to improve the processing speed of data.Based on MapReduce Parallel distributed model under, although data can be carried out burst, distributed treatment, but due to attended operation, special It not that the cartesian product result that non-equivalent attended operation (Theta-join) produces can cause the data volume in network and disk Sharply increasing, cause the biggest I/O and disk expense, therefore, treatment effeciency is the lowest.

Document " Alper Okcan.Processing theta-joins using mapreduce [C] .Proceedings of the 2011ACM SIGMOD International Conference on Management of data,P946-960,ACM, 2011 " disclose the effective ways that a kind of non-equivalent processing two tables under distributed environment connects, be referred to as 1-bucket-theta, the method using connection matrix (join-Matrix) as Theta-join link model, two tables Cartesian product is expressed as connection matrix, and on matrix, all of cell is evenly dispersed in cluster by random algorithm Each node in, the live load of each node is of substantially equal.During this method ensure that parallel computation, institute The internodal load balancing having, improves the treatment effeciency of connection.But, owing to cartesian product is join algorithm Complete or collected works, for non-equivalent connects, the most all of cell all can produce the final result of output, therefore for For the inquiry that selectivity (selectivity) is little, this algorithm will have substantial amounts of node and carry out non-productive work, not Produce output result, therefore efficiency is the lowest.

Summary of the invention

In order to overcome the existing inefficient deficiency of non-equivalent method of attachment, the present invention provides a kind of towards the distributed number of magnanimity According to non-equivalent method of attachment.The method, before the non-equivalent carrying out two tables connects, first selects according to condition of contact Suitably filtering rule, then calculates maximum and the minima of two table linkage fields, according to maximum and minima pair All records in two tables are scanned, and are filtered by the record unrelated with output result, only to the number after filtering According to carrying out the calculating of cartesian product, finally according to condition of contact, the result of cartesian product is carried out secondary contrast, filter out Meet the record of condition of contact.The method can filter out and substantial amounts of not meet condition of contact before doing cartesian product Record, the workload of Reducer can be effectively reduced, improve non-equivalent connect search efficiency.

The technical solution adopted for the present invention to solve the technical problems: a kind of non-equivalent towards magnanimity distributed data is even Connect method, be characterized in comprising the following steps:

Step one, assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry QB is defined as R (A, B) ∞_R.BθS.BS (B, C), then QB is referred to as between relation table R and S one connected by field B Individual non-equivalent Connection inquiring.

Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B, C is the attribute of S.θ is the connectivity function of R and S.QB represents an inquiry relating to R and S, and " ∞ " represents even Connect symbol.Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B is seen Work is the key of R and S, and field A regards the combination of all fields in relation table R in addition to B, field C as Regard in relation table S the combination of all fields in addition to B as.Thus the connection of multiword segment table is converted to distribution Key-value form in formula computing environment.

Step 2, assuming that R (A, B) and S (B, C) is two tables to be connected, connection attribute is B, and connectivity function is θ, And θ ∈ ＞, ＜ ,≤, >=, maximum and minima in R.B record are in R.Bmax and R.Bmin, S.B record respectively Maximum and minima be S.Bmax and S.Bmin respectively, then obtain following theorem:

Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x；If y < R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.

Symbol description: the minima in all records of attribute column B in S.Bmin representation relation table S, R.Bmax Maximum in all records of attribute column B in representation relation table R.Attribute column B in x representation relation table R Arbitrary value, the arbitrary value of attribute column B in y representation relation table S.

Theorem 2: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" < ", then If x < S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x；If Y > R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.

Symbol description: the maximum in all records of attribute column B in S.Bmax representation relation table S, R.Bmin Minima in all records of attribute column B in representation relation table R.

Theorem 3: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" >=", then If x >=S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x；If Y≤R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.

Theorem 4: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ="≤", then If x≤S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x；If Y >=R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.

Step 3, non-equivalent join algorithm:

Algorithm inputs: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, connectivity function θ.

Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T Record, all meets condition R.B θ S.B.

Algorithm flow:

Step1: take out from tables of data R (A, B) and connect row R.B, R.B is sorted, travel through R.B, find record set Maximum R.Bmax and minima R.Bmin；Take out from tables of data S (B, C) and connect row S.B, S.B is sorted, Traversal S.B, finds maximum S.Bmax and minima S.Bmin of record set；

Step2: judge condition of contact θ, if θ=" > ", according to the filtering rule of theorem 1 to R (A, B) and S (B, C) Filter, the data set R'(A, B after being filtered) and S'(B, C)；

Symbol description: R'(A, B) representation relation table R (A, B) filter after a subset, S'(B, C) representation relation table S (B, C) A subset after filtration.

If θ=" R (A, B) and S (B, C) is filtered, after being filtered by < " according to the filtering rule of theorem 2 Data set R'(A, B) and S'(B, C)；

If θ=" >=", according to the filtering rule of theorem 3, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C)；

If θ="≤", according to the filtering rule of theorem 4, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C)；

Step3: will filter after R'(A, B) and S'(B, C) carry out Distributed Calculation, obtain cartesian product result set T'(R.A, R.B,S.B,S.C)；

Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after Result set, the cartesian product of the full attribute column carried out here, therefore all properties row of R and S are included.

Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out secondary sieve Choosing, deletes ineligible record, obtains final connection result collection T (R.A, R.B, S.B, S.C)；

Symbol description: T (R.A, R.B, S.B, S.C) represents cartesian product T'(R.A, R.B, S.B, S.C) carry out secondary Result set after screening, because filtering rule is it cannot be guaranteed that screen out all records not meeting condition of contact, in order to protect The correctness of card result, carries out postsearch screening here.

Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).

The invention has the beneficial effects as follows: the method is before the non-equivalent carrying out two tables connects, first according to condition of contact Select suitable filtering rule, then calculate maximum and the minima of two table linkage fields, according to maximum and minimum All records in two tables are scanned by value, will filter with the output unrelated record of result, only to filtering after Data carry out the calculating of cartesian product, finally according to condition of contact, the result of cartesian product is carried out secondary contrast, sieve Select the record meeting condition of contact.The method can filter out and substantial amounts of not meet connection before doing cartesian product The record of condition, thus significantly reduce the workload of Reducer, improve the search efficiency that non-equivalent connects.

With reference to Fig. 1.Below by an example, beneficial effects of the present invention is described.

Assuming two tables of data R (A, B) and S (B, C), connection attribute B, condition of contact is " > ", the SQL language of inquiry Sentence is:

Select R.B,S.B

From R,S

Where R.B>S.B；

R has 5 records, and S has 3 records, and the result set of the cartesian product of original R and S is 5*3=15 bar note Record.First find maximum and the minima of R.B, be 5 and 1 respectively；The maximum of S.B and minima, be respectively 5 and 3.Owing to connectivity function θ is " > ", therefore according to theorem 1, for R.B arbitrarily records x, it is judged that (x > S.Bmin), i.e. x > 3, retain all records meeting this condition, a subset of R.B can be obtained R'.B={4,5}.Same, y is arbitrarily recorded for S.B, it is judged that (y < R.Bmax), i.e. y < 5, obtain S.B Subset S'.B={3}.R'.B and S'.B is carried out cartesian product, and result set is 2*1=2 bar record, is respectively { (4,3), (5,3) }, through postsearch screening, this result set is the record meeting condition of contact, exports this result set.Logical Crossing the improvement of the inventive method, the calculation cost of cartesian product is reduced to 2 records from 15, improves ratio and reaches (15-2)/15=86%.

Under the distributed type assemblies environment of real 16 nodes, generate 1G-100G data set with TPC-H, use The method of the present invention and the present invention quote the method for document and compare various non-equivalent Connection inquiring, side of the present invention Method is substantially better than literature method, and search efficiency can accelerate about 40-60 times.

With detailed description of the invention, the present invention is elaborated below in conjunction with the accompanying drawings.

Accompanying drawing explanation

Fig. 1 is that the present invention non-equivalent method of attachment towards magnanimity distributed data carries out non-equivalent under distributed environment The exemplary plot of Connection inquiring.Table to be checked is R (A, B) and S (B, C), and connection attribute is classified as B, connects Condition is R.B > S.B.Intended output is recorded as (4,3), (5,3).The maximum of the attribute B in R table is 5, minimum Value is 1.In S table, the maximum of attribute B is 5, and minima is 3.

Detailed description of the invention

The present invention specifically comprises the following steps that towards the non-equivalent method of attachment of magnanimity distributed data

A, Theta-Join are query-defined:

Assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry QB is defined as R(A,B)∞_R.BθS.BS (B, C), then the non-equivalent that QB is referred to as between relation table R and S by field B connects connects Connect inquiry.

Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B, C is the attribute of S.θ is the connectivity function of R and S.QB represents an inquiry relating to R and S, and " ∞ " represents even Connect symbol.Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B can To regard the key of R and S as, field A can be regarded as the combination of all fields in relation table R in addition to B (value) combination (value) of all fields in addition to B during, field C can be regarded as relation table S.Thus The connection making multiword segment table can be with the key-value form being converted in distributed computing environment.

Such as, relation table R (A, B) to be checked comprises attribute A, B, relation table S (B, C) comprise attribute B, C.Downlink connection function #={ > }, non-equivalent connection is classified as B.Inquiry Q_BIt is defined as R (A, B) ∞_{R.B ＞ S.B}S(B,C).Want Ask connection row (R.B, S.B) of output row only two tables of output of inquiry.

B, filtering rule based on maximum and minima:

In order to the data filtering method based on maximum and minima is described, we provide one and enter under distributed environment The example of row non-equivalent Connection inquiring describes the thought of the method.

Assuming that R (A, B) and S (B, C) is two tables to be connected, connection attribute is B, and connectivity function is θ, and θ ∈ ＞, ＜ ,≤, >=, maximum and minima in R.B record are in R.Bmax and R.Bmin, S.B record respectively Maximum and minima are S.Bmax and S.Bmin respectively, then we can obtain following theorem:

Connect example, due to connectivity function θ={ > }, therefore Selection theorem 1 is as filtering rule.

Theorem 1 is to the filtering rule that theorem 4 is before two tables connect, different according to concrete connectivity function θ, In order to the correctness of theorem is described, theorem 1 is proved by we, and the proof of theorem 2-4 is with theorem 1.

Prove (theorem 1): ifAnd x≤S.B_min, then forWe can obtain x≤y, This and condition of contact " > " (x ＞ y) be contradiction, such x will not produce a result note meeting condition of contact Record, therefore this x should be filtered.Similar, ifAnd y >=R.B_max, then for We will obtain y >=x (i.e. x≤y), this also with condition of contact " > " (x ＞ y) be contradiction, the most such y Should also be as being filtered.

C, non-equivalent join algorithm:

Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T Record, all meets condition R.B θ S.B (such as, R.B > S.B).

Algorithm flow:

Step1: take out from tables of data R (A, B) and connect row R.B, R.B is sorted, travel through R.B, find record set Maximum R.Bmax=5 and minima R.Bmin=1；Take out from tables of data S (B, C) and connect row S.B, to S.B Sequence, travels through S.B, finds maximum S.Bmax=5 and minima S.Bmin=3 of record set；

Step2: because θ=" > ", R.B and S.B is filtered (because of search request according to the filtering rule of theorem 1 Output is classified as B, therefore only operates B, reduces the calculating of unrelated row), filtercondition is R.B > 3 (S.Bmin) With S.B < 5 (R.Bmax), data set R'.B={4,5} and S'.B={3} after being filtered；

Step3: R'.B and S'.B after filtering carries out Distributed Calculation, obtains cartesian product result set T'(R.B, S.B)={ (4,3), (5,3) }；

Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out secondary sieve Choosing, deletes ineligible record, obtains final connection result collection T (R.A, R.B, S.B, S.C)；Right T'(R.B, S.B)={ (4,3), (5,3) } in record carry out postsearch screening, the most each record of interpretation meets (R.B > S.B) Condition, obtains final result collection T (R.B, S.B)={ (4,3), (5,3) }；

Step5: output non-equivalent result set T (R.B, S.B)={ (4,3), (5,3) }.

Claims

1. the non-equivalent method of attachment towards magnanimity distributed data, it is characterised in that comprise the following steps:

Step one, assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry QB is defined as R (A, B) ∞_R.BθS.BS (B, C), then QB is referred to as between relation table R and S one connected by field B Individual non-equivalent Connection inquiring；

Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B, C is the attribute of S；θ is the connectivity function of R and S；QB represents an inquiry relating to R and S, and " ∞ " represents even Connect symbol；Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B is seen Work is the key of R and S, and field A regards the combination of all fields in relation table R in addition to B, field C as Regard in relation table S the combination of all fields in addition to B as；Thus the connection of multiword segment table is converted to distribution Key-value form in formula computing environment；

Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x；If y < R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S；

Symbol description: the minima in all records of attribute column B in S.Bmin representation relation table S, R.Bmax Maximum in all records of attribute column B in representation relation table R；Attribute column B in x representation relation table R Arbitrary value, the arbitrary value of attribute column B in y representation relation table S；

Theorem 2: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" < ", then If x < S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x；If Y > R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S；

Symbol description: the maximum in all records of attribute column B in S.Bmax representation relation table S, R.Bmin Minima in all records of attribute column B in representation relation table R；

Theorem 3: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" >=", then If x >=S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x；If Y≤R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S；

Theorem 4: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ="≤", then If x≤S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x；If Y >=R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S；

Step 3, non-equivalent join algorithm:

Algorithm inputs: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, connectivity function θ；

Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T Record, all meets condition R.B θ S.B；

Algorithm flow:

Symbol description: R'(A, B) representation relation table R (A, B) filter after a subset, S'(B, C) representation relation table S (B, C) A subset after filtration；

Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after Result set, the cartesian product of the full attribute column carried out here, therefore all properties row of R and S are included；

Symbol description: T (R.A, R.B, S.B, S.C) represents cartesian product T'(R.A, R.B, S.B, S.C) carry out secondary Result set after screening, because filtering rule is it cannot be guaranteed that screen out all records not meeting condition of contact, in order to protect The correctness of card result, carries out postsearch screening here；

Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).