Non-equivalent method of attachment towards magnanimity distributed data
Technical field
The present invention relates to a kind of non-equivalent method of attachment, particularly to a kind of non-equivalent towards magnanimity distributed data even
Connect method.
Background technology
Under cloud computing environment, data are stored, process and analyze and bring new challenge by the explosive growth of data volume.
Traditional data base and data processing method cannot meet storage and the process demand of big data, at present, the process of main flow
Method is that the parallel processing technique using MapReduce is to improve the processing speed of data.Based on MapReduce
Parallel distributed model under, although data can be carried out burst, distributed treatment, but due to attended operation, special
It not that the cartesian product result that non-equivalent attended operation (Theta-join) produces can cause the data volume in network and disk
Sharply increasing, cause the biggest I/O and disk expense, therefore, treatment effeciency is the lowest.
Document " Alper Okcan.Processing theta-joins using mapreduce [C] .Proceedings of the
2011ACM SIGMOD International Conference on Management of data,P946-960,ACM,
2011 " disclose the effective ways that a kind of non-equivalent processing two tables under distributed environment connects, be referred to as
1-bucket-theta, the method using connection matrix (join-Matrix) as Theta-join link model, two tables
Cartesian product is expressed as connection matrix, and on matrix, all of cell is evenly dispersed in cluster by random algorithm
Each node in, the live load of each node is of substantially equal.During this method ensure that parallel computation, institute
The internodal load balancing having, improves the treatment effeciency of connection.But, owing to cartesian product is join algorithm
Complete or collected works, for non-equivalent connects, the most all of cell all can produce the final result of output, therefore for
For the inquiry that selectivity (selectivity) is little, this algorithm will have substantial amounts of node and carry out non-productive work, not
Produce output result, therefore efficiency is the lowest.
Summary of the invention
In order to overcome the existing inefficient deficiency of non-equivalent method of attachment, the present invention provides a kind of towards the distributed number of magnanimity
According to non-equivalent method of attachment.The method, before the non-equivalent carrying out two tables connects, first selects according to condition of contact
Suitably filtering rule, then calculates maximum and the minima of two table linkage fields, according to maximum and minima pair
All records in two tables are scanned, and are filtered by the record unrelated with output result, only to the number after filtering
According to carrying out the calculating of cartesian product, finally according to condition of contact, the result of cartesian product is carried out secondary contrast, filter out
Meet the record of condition of contact.The method can filter out and substantial amounts of not meet condition of contact before doing cartesian product
Record, the workload of Reducer can be effectively reduced, improve non-equivalent connect search efficiency.
The technical solution adopted for the present invention to solve the technical problems: a kind of non-equivalent towards magnanimity distributed data is even
Connect method, be characterized in comprising the following steps:
Step one, assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry
QB is defined as R (A, B) ∞R.BθS.BS (B, C), then QB is referred to as between relation table R and S one connected by field B
Individual non-equivalent Connection inquiring.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B,
C is the attribute of S.θ is the connectivity function of R and S.QB represents an inquiry relating to R and S, and " ∞ " represents even
Connect symbol.Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B is seen
Work is the key of R and S, and field A regards the combination of all fields in relation table R in addition to B, field C as
Regard in relation table S the combination of all fields in addition to B as.Thus the connection of multiword segment table is converted to distribution
Key-value form in formula computing environment.
Step 2, assuming that R (A, B) and S (B, C) is two tables to be connected, connection attribute is B, and connectivity function is θ,
And θ ∈ >, < ,≤, >=, maximum and minima in R.B record are in R.Bmax and R.Bmin, S.B record respectively
Maximum and minima be S.Bmax and S.Bmin respectively, then obtain following theorem:
Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then
If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If y <
R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the minima in all records of attribute column B in S.Bmin representation relation table S, R.Bmax
Maximum in all records of attribute column B in representation relation table R.Attribute column B in x representation relation table R
Arbitrary value, the arbitrary value of attribute column B in y representation relation table S.
Theorem 2: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" < ", then
If x < S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If
Y > R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the maximum in all records of attribute column B in S.Bmax representation relation table S, R.Bmin
Minima in all records of attribute column B in representation relation table R.
Theorem 3: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" >=", then
If x >=S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If
Y≤R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 4: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ="≤", then
If x≤S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If
Y >=R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Step 3, non-equivalent join algorithm:
Algorithm inputs: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, connectivity function θ.
Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T
Record, all meets condition R.B θ S.B.
Algorithm flow:
Step1: take out from tables of data R (A, B) and connect row R.B, R.B is sorted, travel through R.B, find record set
Maximum R.Bmax and minima R.Bmin;Take out from tables of data S (B, C) and connect row S.B, S.B is sorted,
Traversal S.B, finds maximum S.Bmax and minima S.Bmin of record set;
Step2: judge condition of contact θ, if θ=" > ", according to the filtering rule of theorem 1 to R (A, B) and S (B, C)
Filter, the data set R'(A, B after being filtered) and S'(B, C);
Symbol description: R'(A, B) representation relation table R (A, B) filter after a subset, S'(B, C) representation relation table S (B, C)
A subset after filtration.
If θ=" R (A, B) and S (B, C) is filtered, after being filtered by < " according to the filtering rule of theorem 2
Data set R'(A, B) and S'(B, C);
If θ=" >=", according to the filtering rule of theorem 3, R (A, B) and S (B, C) is filtered, after being filtered
Data set R'(A, B) and S'(B, C);
If θ="≤", according to the filtering rule of theorem 4, R (A, B) and S (B, C) is filtered, after being filtered
Data set R'(A, B) and S'(B, C);
Step3: will filter after R'(A, B) and S'(B, C) carry out Distributed Calculation, obtain cartesian product result set T'(R.A,
R.B,S.B,S.C);
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after
Result set, the cartesian product of the full attribute column carried out here, therefore all properties row of R and S are included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out secondary sieve
Choosing, deletes ineligible record, obtains final connection result collection T (R.A, R.B, S.B, S.C);
Symbol description: T (R.A, R.B, S.B, S.C) represents cartesian product T'(R.A, R.B, S.B, S.C) carry out secondary
Result set after screening, because filtering rule is it cannot be guaranteed that screen out all records not meeting condition of contact, in order to protect
The correctness of card result, carries out postsearch screening here.
Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).
The invention has the beneficial effects as follows: the method is before the non-equivalent carrying out two tables connects, first according to condition of contact
Select suitable filtering rule, then calculate maximum and the minima of two table linkage fields, according to maximum and minimum
All records in two tables are scanned by value, will filter with the output unrelated record of result, only to filtering after
Data carry out the calculating of cartesian product, finally according to condition of contact, the result of cartesian product is carried out secondary contrast, sieve
Select the record meeting condition of contact.The method can filter out and substantial amounts of not meet connection before doing cartesian product
The record of condition, thus significantly reduce the workload of Reducer, improve the search efficiency that non-equivalent connects.
With reference to Fig. 1.Below by an example, beneficial effects of the present invention is described.
Assuming two tables of data R (A, B) and S (B, C), connection attribute B, condition of contact is " > ", the SQL language of inquiry
Sentence is:
Select R.B,S.B
From R,S
Where R.B>S.B;
R has 5 records, and S has 3 records, and the result set of the cartesian product of original R and S is 5*3=15 bar note
Record.First find maximum and the minima of R.B, be 5 and 1 respectively;The maximum of S.B and minima, be respectively
5 and 3.Owing to connectivity function θ is " > ", therefore according to theorem 1, for R.B arbitrarily records x, it is judged that
(x > S.Bmin), i.e. x > 3, retain all records meeting this condition, a subset of R.B can be obtained
R'.B={4,5}.Same, y is arbitrarily recorded for S.B, it is judged that (y < R.Bmax), i.e. y < 5, obtain S.B
Subset S'.B={3}.R'.B and S'.B is carried out cartesian product, and result set is 2*1=2 bar record, is respectively
{ (4,3), (5,3) }, through postsearch screening, this result set is the record meeting condition of contact, exports this result set.Logical
Crossing the improvement of the inventive method, the calculation cost of cartesian product is reduced to 2 records from 15, improves ratio and reaches
(15-2)/15=86%.
Under the distributed type assemblies environment of real 16 nodes, generate 1G-100G data set with TPC-H, use
The method of the present invention and the present invention quote the method for document and compare various non-equivalent Connection inquiring, side of the present invention
Method is substantially better than literature method, and search efficiency can accelerate about 40-60 times.
With detailed description of the invention, the present invention is elaborated below in conjunction with the accompanying drawings.
Accompanying drawing explanation
Fig. 1 is that the present invention non-equivalent method of attachment towards magnanimity distributed data carries out non-equivalent under distributed environment
The exemplary plot of Connection inquiring.Table to be checked is R (A, B) and S (B, C), and connection attribute is classified as B, connects
Condition is R.B > S.B.Intended output is recorded as (4,3), (5,3).The maximum of the attribute B in R table is 5, minimum
Value is 1.In S table, the maximum of attribute B is 5, and minima is 3.
Detailed description of the invention
The present invention specifically comprises the following steps that towards the non-equivalent method of attachment of magnanimity distributed data
A, Theta-Join are query-defined:
Assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry QB is defined as
R(A,B)∞R.BθS.BS (B, C), then the non-equivalent that QB is referred to as between relation table R and S by field B connects connects
Connect inquiry.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B,
C is the attribute of S.θ is the connectivity function of R and S.QB represents an inquiry relating to R and S, and " ∞ " represents even
Connect symbol.Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B can
To regard the key of R and S as, field A can be regarded as the combination of all fields in relation table R in addition to B
(value) combination (value) of all fields in addition to B during, field C can be regarded as relation table S.Thus
The connection making multiword segment table can be with the key-value form being converted in distributed computing environment.
Such as, relation table R (A, B) to be checked comprises attribute A, B, relation table S (B, C) comprise attribute B,
C.Downlink connection function #={ > }, non-equivalent connection is classified as B.Inquiry QBIt is defined as R (A, B) ∞R.B > S.BS(B,C).Want
Ask connection row (R.B, S.B) of output row only two tables of output of inquiry.
B, filtering rule based on maximum and minima:
In order to the data filtering method based on maximum and minima is described, we provide one and enter under distributed environment
The example of row non-equivalent Connection inquiring describes the thought of the method.
Assuming that R (A, B) and S (B, C) is two tables to be connected, connection attribute is B, and connectivity function is θ, and
θ ∈ >, < ,≤, >=, maximum and minima in R.B record are in R.Bmax and R.Bmin, S.B record respectively
Maximum and minima are S.Bmax and S.Bmin respectively, then we can obtain following theorem:
Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then
If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If y <
R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the minima in all records of attribute column B in S.Bmin representation relation table S, R.Bmax
Maximum in all records of attribute column B in representation relation table R.Attribute column B in x representation relation table R
Arbitrary value, the arbitrary value of attribute column B in y representation relation table S.
Connect example, due to connectivity function θ={ > }, therefore Selection theorem 1 is as filtering rule.
Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then
If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If y <
R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 2: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" < ", then
If x < S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If
Y > R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the maximum in all records of attribute column B in S.Bmax representation relation table S, R.Bmin
Minima in all records of attribute column B in representation relation table R.
Theorem 3: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" >=", then
If x >=S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If
Y≤R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 4: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ="≤", then
If x≤S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If
Y >=R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 1 is to the filtering rule that theorem 4 is before two tables connect, different according to concrete connectivity function θ,
In order to the correctness of theorem is described, theorem 1 is proved by we, and the proof of theorem 2-4 is with theorem 1.
Prove (theorem 1): ifAnd x≤S.Bmin, then forWe can obtain x≤y,
This and condition of contact " > " (x > y) be contradiction, such x will not produce a result note meeting condition of contact
Record, therefore this x should be filtered.Similar, ifAnd y >=R.Bmax, then for
We will obtain y >=x (i.e. x≤y), this also with condition of contact " > " (x > y) be contradiction, the most such y
Should also be as being filtered.
C, non-equivalent join algorithm:
Algorithm inputs: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, connectivity function θ.
Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T
Record, all meets condition R.B θ S.B (such as, R.B > S.B).
Algorithm flow:
Step1: take out from tables of data R (A, B) and connect row R.B, R.B is sorted, travel through R.B, find record set
Maximum R.Bmax=5 and minima R.Bmin=1;Take out from tables of data S (B, C) and connect row S.B, to S.B
Sequence, travels through S.B, finds maximum S.Bmax=5 and minima S.Bmin=3 of record set;
Step2: because θ=" > ", R.B and S.B is filtered (because of search request according to the filtering rule of theorem 1
Output is classified as B, therefore only operates B, reduces the calculating of unrelated row), filtercondition is R.B > 3 (S.Bmin)
With S.B < 5 (R.Bmax), data set R'.B={4,5} and S'.B={3} after being filtered;
Symbol description: R'(A, B) representation relation table R (A, B) filter after a subset, S'(B, C) representation relation table S (B, C)
A subset after filtration.
If θ=" R (A, B) and S (B, C) is filtered, after being filtered by < " according to the filtering rule of theorem 2
Data set R'(A, B) and S'(B, C);
If θ=" >=", according to the filtering rule of theorem 3, R (A, B) and S (B, C) is filtered, after being filtered
Data set R'(A, B) and S'(B, C);
If θ="≤", according to the filtering rule of theorem 4, R (A, B) and S (B, C) is filtered, after being filtered
Data set R'(A, B) and S'(B, C);
Step3: R'.B and S'.B after filtering carries out Distributed Calculation, obtains cartesian product result set
T'(R.B, S.B)={ (4,3), (5,3) };
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after
Result set, the cartesian product of the full attribute column carried out here, therefore all properties row of R and S are included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out secondary sieve
Choosing, deletes ineligible record, obtains final connection result collection T (R.A, R.B, S.B, S.C);Right
T'(R.B, S.B)={ (4,3), (5,3) } in record carry out postsearch screening, the most each record of interpretation meets (R.B > S.B)
Condition, obtains final result collection T (R.B, S.B)={ (4,3), (5,3) };
Symbol description: T (R.A, R.B, S.B, S.C) represents cartesian product T'(R.A, R.B, S.B, S.C) carry out secondary
Result set after screening, because filtering rule is it cannot be guaranteed that screen out all records not meeting condition of contact, in order to protect
The correctness of card result, carries out postsearch screening here.
Step5: output non-equivalent result set T (R.B, S.B)={ (4,3), (5,3) }.