Non-equivalent connection method towards magnanimity distributed data
Technical field
The present invention relates to a kind of non-equivalent connection method, in particular to a kind of non-equivalent towards magnanimity distributed data connects
Connect method.
Background technique
Under cloud computing environment, the explosive growth of data volume brings new challenge to data storage, processing and analysis.
Traditional database and data processing method is unable to satisfy the storage and processing demand of big data, currently, the processing method of mainstream
It is the processing speed that data are improved using the parallel processing technique of MapReduce.In the parallel distributed based on MapReduce
Under model, although fragment, distributed treatment can be carried out to data, due to attended operation, especially non-equivalent attended operation
(Theta-join) the cartesian product result generated will lead to the data volume in network and disk and sharply increase, and cause very big
I/O and disk expense, therefore, treatment effeciency is very low.
Document " Alper Okcan.Processing theta-joins using mapreduce [C]
.Proceedings of the2011ACM SIGMOD International Conference on Management of
Data, P946-960, ACM, 2011 " disclose it is a kind of two tables are handled under distributed environment non-equivalent connection have efficacious prescriptions
Method, referred to as 1-bucket-theta, this method by connection matrix (join-Matrix) be used as Theta-join link model, two
The cartesian product of table is expressed as connection matrix, and all cells are evenly dispersed to cluster by random algorithm on matrix
In each node in, the workload of each node is of substantially equal.It is all during this method ensure that parallel computation
Load balancing between node improves the treatment effeciency of connection.But since cartesian product is the complete or collected works of join algorithm, for
For non-equivalent connection, and not all cell can all generate the final result of output, therefore for selectivity
(selectivity) for small inquiry, which will have a large amount of node and carry out non-productive work, not generate output knot
Fruit, therefore efficiency is very low.
Summary of the invention
In order to overcome the shortcomings of that existing non-equivalent connection method low efficiency, the present invention provide one kind towards magnanimity distribution number
According to non-equivalent connection method.For this method before carrying out the non-equivalent connection of two tables, it is suitable to be selected first according to condition of contact
Filtering rule, the maximum value and minimum value of two table linkage fields are then calculated, according to maximum value and minimum value in two tables
All records are scanned, and the record unrelated with output result is filtered, and only carry out Descartes to filtered data
Long-pending calculating finally carries out secondary comparison according to result of the condition of contact to cartesian product, filters out the note for meeting condition of contact
Record.This method can filter out the record for not meeting condition of contact largely, can be effectively reduced before doing cartesian product
The workload of Reducer improves the search efficiency of non-equivalent connection.
A kind of the technical solution adopted by the present invention to solve the technical problems: non-equivalent company towards magnanimity distributed data
Method is connect, its main feature is that the following steps are included:
Step 1: assume there are two relation table R (A, B) and S (B, C), function # belong to>,<,>=,≤, inquiry QB is defined
For R (A, B) ∞R.BθS.BS (B, C), then QB be known as the non-equivalent connected between relation table R and S by field B connection look into
It askes.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, and S (B, C) representation relation table S, B, C are the categories of S
Property.θ is the contiguous function of R and S.QB represents one and is related to the inquiry of R and S, and " ∞ " represents bound symbol.Due in distributed ring
Under border, the computational format of data is key-value form, and therefore, field B regards that the key of R and S, field A regard relationship as
The combination of all fields in table R other than B, field C regard the combination of all fields in relation table S other than B as.
To which the connection of multiword segment table is converted to the key-value form in distributed computing environment.
Step 2: assume that R (A, B) and S (B, C) are two tables to be connected, connection attribute is B, and contiguous function is θ, and θ
Maximum value and minimum value in ∈ { >, < ,≤, >=}, R.B record are R.Bmax and R.Bmin, the maximum in S.B record respectively
Value and minimum value are S.Bmax and S.Bmin respectively, then obtain following theorem:
Theorem 1: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" > ", then
If x > S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y < R.Bmax,
Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the minimum value in all records of the attribute column B in S.Bmin representation relation table S, R.Bmax are represented
Maximum value in all records of attribute column B in relation table R.The arbitrary value of attribute column B in x representation relation table R, y are represented
The arbitrary value of attribute column B in relation table S.
Theorem 2: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" < ", then
If x < S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y > R.Bmin,
Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the maximum value in all records of the attribute column B in S.Bmax representation relation table S, R.Bmin are represented
Minimum value in all records of attribute column B in relation table R.
Theorem 3: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" >=", then
If x >=S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y≤R.Bmax,
Then retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 4: for two specified table R (A, B) and S (B, C), connection attribute B, θ="≤", thenIf x≤S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y
>=R.Bmin then retains y and otherwise filters out the y from the attribute B of relation table S.
Step 3: non-equivalent join algorithm:
Algorithm input: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, contiguous function θ.
Algorithm output: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, for any note in T
Record, all meets condition R.B θ S.B.
Algorithm flow:
Step1: it from connection column R.B is taken out in tables of data R (A, B), sorts to R.B, traverses R.B, find record set most
Big value R.Bmax and minimum value R.Bmin;From connection column S.B is taken out in tables of data S (B, C), sorts to S.B, traverse S.B, find
The maximum value S.Bmax and minimum value S.Bmin of record set;
Step2: judging condition of contact θ, if θ=" > ", according to the filtering rule of theorem 1 to R (A, B) and S (B, C) into
Row filtering, obtains filtered data set R'(A, B) and S'(B, C);
Symbol description: R'(A, B) representation relation table R (A, B) filtered a subset, S'(B, C) representation relation table S
(B, C) filtered a subset.
If θ=" < ", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 2, obtain filtered number
According to collection R'(A, B) and S'(B, C);
If θ=" >=", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 3, obtained filtered
Data set R'(A, B) and S'(B, C);
If θ="≤", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 4, obtained filtered
Data set R'(A, B) and S'(B, C);
Step3: filtered R'(A, B) and S'(B, C) is subjected to distributed computing, obtains cartesian product result set T'
(R.A,R.B,S.B,S.C);
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after
Result set, the cartesian product of the full attribute column carried out here, thus R and S all properties column be included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out postsearch screening, delete
Except ineligible record, final connection result collection T (R.A, R.B, S.B, S.C) is obtained;
Symbol description: T (R.A, R.B, S.B, S.C) indicate to cartesian product T'(R.A, R.B, S.B, S.C) carry out it is secondary
Result set after screening, because filtering rule cannot be guaranteed to screen out all records for not meeting condition of contact, in order to guarantee to tie
The correctness of fruit carries out postsearch screening here.
Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).
The beneficial effects of the present invention are: this method carry out two tables non-equivalent connection before, first according to condition of contact
Suitable filtering rule is selected, the maximum value and minimum value of two table linkage fields are then calculated, according to maximum value and minimum value pair
All records in two tables are scanned, and will be filtered with the unrelated record of output result, only to filtered data into
The calculating of row cartesian product finally carries out secondary comparison according to result of the condition of contact to cartesian product, filters out and meet connection
The record of condition.This method can filter out before doing cartesian product and largely not meet the record of condition of contact, to have
The workload for reducing to effect Reducer, improves the search efficiency of non-equivalent connection.
Referring to Fig.1.Illustrate beneficial effects of the present invention below by an example.
Assuming that two tables of data R (A, B) and S (B, C), connection attribute B, condition of contact is " > ", the SQL statement of inquiry are as follows:
Select R.B,S.B
From R,S
Where R.B>S.B;
R has 5 records, and S has 3 records, and the result set of the cartesian product of original R and S is 5*3=15 item record.It is first
The maximum value and minimum value for first finding R.B, are 5 and 1 respectively;The maximum value and minimum value of S.B, is 5 and 3 respectively.Due to connection
Function # is " > ", therefore any record x in R.B is judged (x > S.Bmin), i.e. x > 3, retains all symbols according to theorem 1
Close the record of the condition, a subset R'.B={ 4,5 } of available R.B.Likewise, sentencing for any record y of S.B
Disconnected (y < R.Bmax), i.e. y < 5, obtain a subset S'.B={ 3 } of S.B.Cartesian product, result set are carried out to R'.B and S'.B
It is { (4,3), (5,3) } respectively for 2*1=2 item record, by postsearch screening, which is the record for meeting condition of contact,
Export the result set.Improvement by the method for the invention, the calculating cost of cartesian product are reduced to 2 records from 15, improve
Ratio has reached (15-2)/15=86%.
Under the distributed type assemblies environment of true 16 node, 1G-100G data set is generated with TPC-H, using this hair
The method of bright method and citation of the present invention compares various non-equivalent Connection inquirings, and the method for the present invention is obviously excellent
In literature method, search efficiency can accelerate 40-60 times or so.
It elaborates with reference to the accompanying drawings and detailed description to the present invention.
Detailed description of the invention
Fig. 1 is that the non-equivalent connection method of the invention towards magnanimity distributed data carries out non-equivalent under distributed environment
The exemplary diagram of Connection inquiring.Table to be checked is R (A, B) and S (B, C), and connection attribute is classified as B, condition of contact R.B > S.B.
Expected output is recorded as (4,3), (5,3).The maximum value of attribute B in R table is 5, and minimum value is 1.Attribute B is most in S table
Big value is 5, and minimum value is 3.
Specific embodiment
Specific step is as follows for non-equivalent connection method of the present invention towards magnanimity distributed data:
A, Theta-Join is query-defined:
Assuming that there are two relation table R (A, B) and S (B, C), function # belong to>,<,>=,≤, inquire QB be defined as R (A,
B)∞R.BθS.BS (B, C), then QB is known as the non-equivalent Connection inquiring connected between relation table R and S by field B.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, and S (B, C) representation relation table S, B, C are the categories of S
Property.θ is the contiguous function of R and S.QB represents one and is related to the inquiry of R and S, and " ∞ " represents bound symbol.Due in distributed ring
Under border, the computational format of data is key-value form, and therefore, field B can be regarded as the key of R and S, and field A can be seen
Work is the combination (value) of all fields in relation table R other than B, field C can be regarded as in relation table S in addition to B with
The combination (value) of outer all fields.So that the connection of multiword segment table can be converted in distributed computing environment
Key-value form.
For example, relation table R (A, B) to be checked includes attribute A, B, relation table S (B, C) includes attribute B, C.Downlink connection letter
Number θ={ > }, non-equivalent connection are classified as B.Inquire QBIt is defined as R (A, B) ∞R.B > S.BS(B,C).It is required that the output column of inquiry are only
Export the connection column (R.B, S.B) of two tables.
B, the filtering rule based on maximum value and minimum value:
In order to illustrate the data filtering method based on maximum value and minimum value, we provide one under distributed environment into
The example of row non-equivalent Connection inquiring describes the thought of this method.
Assuming that R (A, B) and S (B, C) are two tables to be connected, connection attribute is B, and contiguous function is θ, and θ ∈ >,
< ,≤, >=, maximum value and minimum value in R.B record are R.Bmax and R.Bmin respectively, maximum value in S.B record and most
Small value is S.Bmax and S.Bmin respectively, then our available following theorems:
Theorem 1: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" > ", then
If x > S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y < R.Bmax,
Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the minimum value in all records of the attribute column B in S.Bmin representation relation table S, R.Bmax are represented
Maximum value in all records of attribute column B in relation table R.The arbitrary value of attribute column B in x representation relation table R, y are represented
The arbitrary value of attribute column B in relation table S.
Example is connected, due to contiguous function θ={ > }, Selection theorem 1 is used as filtering rule.
Theorem 1: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" > ", then
If x > S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y < R.Bmax,
Retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 2: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" < ", then
If x < S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y > R.Bmin,
Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the maximum value in all records of the attribute column B in S.Bmax representation relation table S, R.Bmin are represented
Minimum value in all records of attribute column B in relation table R.
Theorem 3: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" >=", then
If x >=S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y≤R.Bmax,
Then retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 4: for two specified table R (A, B) and S (B, C), connection attribute B, θ="≤", then
If x≤S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y >=R.Bmin,
Then retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 1 to theorem 4 be two tables connection before filtering rule, it is different according to specific contiguous function θ, in order to
Illustrate the correctness of theorem, we prove theorem 1, and the proof of theorem 2-4 is the same as theorem 1.
It proves (theorem 1): ifAnd x≤S.Bmin, then forWe can all obtain x≤
Y, this and condition of contact " > " (x > y) be it is contradictory, such x will not generate the result record for meeting condition of contact,
Therefore the x should be filtered.It is similar, ifAnd y >=R.Bmax, then forWe will
Y >=x (i.e. x≤y) is obtained, this is also contradictory with condition of contact " > " (x > y), therefore such y should also be as being filtered.
C, non-equivalent join algorithm:
Algorithm input: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, contiguous function θ.
Algorithm output: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, for any note in T
Record, all meets condition R.B θ S.B (for example, R.B > S.B).
Algorithm flow:
Step1: it from connection column R.B is taken out in tables of data R (A, B), sorts to R.B, traverses R.B, find record set most
Big value R.Bmax=5 and minimum value R.Bmin=1;From connection column S.B is taken out in tables of data S (B, C), sort to S.B, traversal
S.B finds the maximum value S.Bmax=5 and minimum value S.Bmin=3 of record set;
Step2: because of θ=" > ", R.B and S.B are filtered (because query requirement is defeated according to the filtering rule of theorem 1
Fall out as B, therefore only B operated, reduce the calculating of unrelated column), filter condition R.B>3 (S.Bmin) and S.B<5
(R.Bmax), filtered data set R'.B={ 4,5 } and S'.B={ 3 } are obtained;
Symbol description: R'(A, B) representation relation table R (A, B) filtered a subset, S'(B, C) representation relation table S
(B, C) filtered a subset.
If θ=" < ", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 2, obtain filtered number
According to collection R'(A, B) and S'(B, C);
If θ=" >=", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 3, obtained filtered
Data set R'(A, B) and S'(B, C);
If θ="≤", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 4, obtained filtered
Data set R'(A, B) and S'(B, C);
Step3: carrying out distributed computing for filtered R'.B and S'.B, obtain cartesian product result set T'(R.B,
S.B)={ (4,3), (5,3) };
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after
Result set, the cartesian product of the full attribute column carried out here, thus R and S all properties column be included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out postsearch screening, delete
Except ineligible record, final connection result collection T (R.A, R.B, S.B, S.C) is obtained;To T'(R.B, S.B)=(4,
3), (5,3) } in record carry out postsearch screening, interpretation it is whether each record meet (R.B > S.B) condition, most terminated
Fruit collects T (R.B, S.B)={ (4,3), (5,3) };
Symbol description: T (R.A, R.B, S.B, S.C) indicate to cartesian product T'(R.A, R.B, S.B, S.C) carry out it is secondary
Result set after screening, because filtering rule cannot be guaranteed to screen out all records for not meeting condition of contact, in order to guarantee to tie
The correctness of fruit carries out postsearch screening here.
Step5: output non-equivalent result set T (R.B, S.B)={ (4,3), (5,3) }.