CN106021386B - Non-equivalent connection method towards magnanimity distributed data - Google Patents

Non-equivalent connection method towards magnanimity distributed data Download PDF

Info

Publication number
CN106021386B
CN106021386B CN201610312145.4A CN201610312145A CN106021386B CN 106021386 B CN106021386 B CN 106021386B CN 201610312145 A CN201610312145 A CN 201610312145A CN 106021386 B CN106021386 B CN 106021386B
Authority
CN
China
Prior art keywords
attribute
relation table
connection
record
filtered
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610312145.4A
Other languages
Chinese (zh)
Other versions
CN106021386A (en
Inventor
刘文洁
李战怀
潘巍
张晓�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunyao Technology (Zhejiang) Co.,Ltd.
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201610312145.4A priority Critical patent/CN106021386B/en
Publication of CN106021386A publication Critical patent/CN106021386A/en
Application granted granted Critical
Publication of CN106021386B publication Critical patent/CN106021386B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The invention discloses a kind of non-equivalent connection method towards magnanimity distributed data, the technical issues of for solving existing non-equivalent connection method low efficiency.Technical solution is before carrying out the non-equivalent connection of two tables, suitable filtering rule is selected according to condition of contact first, then the maximum value and minimum value of two table linkage fields are calculated, all records in two tables are scanned according to maximum value and minimum value, the record unrelated with output result is filtered, only filtered data are carried out with the calculating of cartesian product, secondary comparison is finally carried out according to result of the condition of contact to cartesian product, filters out the record for meeting condition of contact.This method can filter out the record for not meeting condition of contact largely, can be effectively reduced the workload of Reducer before doing cartesian product, improve the search efficiency of non-equivalent connection.

Description

Non-equivalent connection method towards magnanimity distributed data
Technical field
The present invention relates to a kind of non-equivalent connection method, in particular to a kind of non-equivalent towards magnanimity distributed data connects Connect method.
Background technique
Under cloud computing environment, the explosive growth of data volume brings new challenge to data storage, processing and analysis. Traditional database and data processing method is unable to satisfy the storage and processing demand of big data, currently, the processing method of mainstream It is the processing speed that data are improved using the parallel processing technique of MapReduce.In the parallel distributed based on MapReduce Under model, although fragment, distributed treatment can be carried out to data, due to attended operation, especially non-equivalent attended operation (Theta-join) the cartesian product result generated will lead to the data volume in network and disk and sharply increase, and cause very big I/O and disk expense, therefore, treatment effeciency is very low.
Document " Alper Okcan.Processing theta-joins using mapreduce [C] .Proceedings of the2011ACM SIGMOD International Conference on Management of Data, P946-960, ACM, 2011 " disclose it is a kind of two tables are handled under distributed environment non-equivalent connection have efficacious prescriptions Method, referred to as 1-bucket-theta, this method by connection matrix (join-Matrix) be used as Theta-join link model, two The cartesian product of table is expressed as connection matrix, and all cells are evenly dispersed to cluster by random algorithm on matrix In each node in, the workload of each node is of substantially equal.It is all during this method ensure that parallel computation Load balancing between node improves the treatment effeciency of connection.But since cartesian product is the complete or collected works of join algorithm, for For non-equivalent connection, and not all cell can all generate the final result of output, therefore for selectivity (selectivity) for small inquiry, which will have a large amount of node and carry out non-productive work, not generate output knot Fruit, therefore efficiency is very low.
Summary of the invention
In order to overcome the shortcomings of that existing non-equivalent connection method low efficiency, the present invention provide one kind towards magnanimity distribution number According to non-equivalent connection method.For this method before carrying out the non-equivalent connection of two tables, it is suitable to be selected first according to condition of contact Filtering rule, the maximum value and minimum value of two table linkage fields are then calculated, according to maximum value and minimum value in two tables All records are scanned, and the record unrelated with output result is filtered, and only carry out Descartes to filtered data Long-pending calculating finally carries out secondary comparison according to result of the condition of contact to cartesian product, filters out the note for meeting condition of contact Record.This method can filter out the record for not meeting condition of contact largely, can be effectively reduced before doing cartesian product The workload of Reducer improves the search efficiency of non-equivalent connection.
A kind of the technical solution adopted by the present invention to solve the technical problems: non-equivalent company towards magnanimity distributed data Method is connect, its main feature is that the following steps are included:
Step 1: assume there are two relation table R (A, B) and S (B, C), function # belong to>,<,>=,≤, inquiry QB is defined For R (A, B) ∞R.BθS.BS (B, C), then QB be known as the non-equivalent connected between relation table R and S by field B connection look into It askes.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, and S (B, C) representation relation table S, B, C are the categories of S Property.θ is the contiguous function of R and S.QB represents one and is related to the inquiry of R and S, and " ∞ " represents bound symbol.Due in distributed ring Under border, the computational format of data is key-value form, and therefore, field B regards that the key of R and S, field A regard relationship as The combination of all fields in table R other than B, field C regard the combination of all fields in relation table S other than B as. To which the connection of multiword segment table is converted to the key-value form in distributed computing environment.
Step 2: assume that R (A, B) and S (B, C) are two tables to be connected, connection attribute is B, and contiguous function is θ, and θ Maximum value and minimum value in ∈ { >, < ,≤, >=}, R.B record are R.Bmax and R.Bmin, the maximum in S.B record respectively Value and minimum value are S.Bmax and S.Bmin respectively, then obtain following theorem:
Theorem 1: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" > ", then If x > S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y < R.Bmax, Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the minimum value in all records of the attribute column B in S.Bmin representation relation table S, R.Bmax are represented Maximum value in all records of attribute column B in relation table R.The arbitrary value of attribute column B in x representation relation table R, y are represented The arbitrary value of attribute column B in relation table S.
Theorem 2: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" < ", then If x < S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y > R.Bmin, Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the maximum value in all records of the attribute column B in S.Bmax representation relation table S, R.Bmin are represented Minimum value in all records of attribute column B in relation table R.
Theorem 3: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" >=", then If x >=S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y≤R.Bmax, Then retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 4: for two specified table R (A, B) and S (B, C), connection attribute B, θ="≤", thenIf x≤S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y >=R.Bmin then retains y and otherwise filters out the y from the attribute B of relation table S.
Step 3: non-equivalent join algorithm:
Algorithm input: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, contiguous function θ.
Algorithm output: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, for any note in T Record, all meets condition R.B θ S.B.
Algorithm flow:
Step1: it from connection column R.B is taken out in tables of data R (A, B), sorts to R.B, traverses R.B, find record set most Big value R.Bmax and minimum value R.Bmin;From connection column S.B is taken out in tables of data S (B, C), sorts to S.B, traverse S.B, find The maximum value S.Bmax and minimum value S.Bmin of record set;
Step2: judging condition of contact θ, if θ=" > ", according to the filtering rule of theorem 1 to R (A, B) and S (B, C) into Row filtering, obtains filtered data set R'(A, B) and S'(B, C);
Symbol description: R'(A, B) representation relation table R (A, B) filtered a subset, S'(B, C) representation relation table S (B, C) filtered a subset.
If θ=" < ", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 2, obtain filtered number According to collection R'(A, B) and S'(B, C);
If θ=" >=", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 3, obtained filtered Data set R'(A, B) and S'(B, C);
If θ="≤", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 4, obtained filtered Data set R'(A, B) and S'(B, C);
Step3: filtered R'(A, B) and S'(B, C) is subjected to distributed computing, obtains cartesian product result set T' (R.A,R.B,S.B,S.C);
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after Result set, the cartesian product of the full attribute column carried out here, thus R and S all properties column be included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out postsearch screening, delete Except ineligible record, final connection result collection T (R.A, R.B, S.B, S.C) is obtained;
Symbol description: T (R.A, R.B, S.B, S.C) indicate to cartesian product T'(R.A, R.B, S.B, S.C) carry out it is secondary Result set after screening, because filtering rule cannot be guaranteed to screen out all records for not meeting condition of contact, in order to guarantee to tie The correctness of fruit carries out postsearch screening here.
Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).
The beneficial effects of the present invention are: this method carry out two tables non-equivalent connection before, first according to condition of contact Suitable filtering rule is selected, the maximum value and minimum value of two table linkage fields are then calculated, according to maximum value and minimum value pair All records in two tables are scanned, and will be filtered with the unrelated record of output result, only to filtered data into The calculating of row cartesian product finally carries out secondary comparison according to result of the condition of contact to cartesian product, filters out and meet connection The record of condition.This method can filter out before doing cartesian product and largely not meet the record of condition of contact, to have The workload for reducing to effect Reducer, improves the search efficiency of non-equivalent connection.
Referring to Fig.1.Illustrate beneficial effects of the present invention below by an example.
Assuming that two tables of data R (A, B) and S (B, C), connection attribute B, condition of contact is " > ", the SQL statement of inquiry are as follows:
Select R.B,S.B
From R,S
Where R.B>S.B;
R has 5 records, and S has 3 records, and the result set of the cartesian product of original R and S is 5*3=15 item record.It is first The maximum value and minimum value for first finding R.B, are 5 and 1 respectively;The maximum value and minimum value of S.B, is 5 and 3 respectively.Due to connection Function # is " > ", therefore any record x in R.B is judged (x > S.Bmin), i.e. x > 3, retains all symbols according to theorem 1 Close the record of the condition, a subset R'.B={ 4,5 } of available R.B.Likewise, sentencing for any record y of S.B Disconnected (y < R.Bmax), i.e. y < 5, obtain a subset S'.B={ 3 } of S.B.Cartesian product, result set are carried out to R'.B and S'.B It is { (4,3), (5,3) } respectively for 2*1=2 item record, by postsearch screening, which is the record for meeting condition of contact, Export the result set.Improvement by the method for the invention, the calculating cost of cartesian product are reduced to 2 records from 15, improve Ratio has reached (15-2)/15=86%.
Under the distributed type assemblies environment of true 16 node, 1G-100G data set is generated with TPC-H, using this hair The method of bright method and citation of the present invention compares various non-equivalent Connection inquirings, and the method for the present invention is obviously excellent In literature method, search efficiency can accelerate 40-60 times or so.
It elaborates with reference to the accompanying drawings and detailed description to the present invention.
Detailed description of the invention
Fig. 1 is that the non-equivalent connection method of the invention towards magnanimity distributed data carries out non-equivalent under distributed environment The exemplary diagram of Connection inquiring.Table to be checked is R (A, B) and S (B, C), and connection attribute is classified as B, condition of contact R.B > S.B. Expected output is recorded as (4,3), (5,3).The maximum value of attribute B in R table is 5, and minimum value is 1.Attribute B is most in S table Big value is 5, and minimum value is 3.
Specific embodiment
Specific step is as follows for non-equivalent connection method of the present invention towards magnanimity distributed data:
A, Theta-Join is query-defined:
Assuming that there are two relation table R (A, B) and S (B, C), function # belong to>,<,>=,≤, inquire QB be defined as R (A, B)∞R.BθS.BS (B, C), then QB is known as the non-equivalent Connection inquiring connected between relation table R and S by field B.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, and S (B, C) representation relation table S, B, C are the categories of S Property.θ is the contiguous function of R and S.QB represents one and is related to the inquiry of R and S, and " ∞ " represents bound symbol.Due in distributed ring Under border, the computational format of data is key-value form, and therefore, field B can be regarded as the key of R and S, and field A can be seen Work is the combination (value) of all fields in relation table R other than B, field C can be regarded as in relation table S in addition to B with The combination (value) of outer all fields.So that the connection of multiword segment table can be converted in distributed computing environment Key-value form.
For example, relation table R (A, B) to be checked includes attribute A, B, relation table S (B, C) includes attribute B, C.Downlink connection letter Number θ={ > }, non-equivalent connection are classified as B.Inquire QBIt is defined as R (A, B) ∞R.B > S.BS(B,C).It is required that the output column of inquiry are only Export the connection column (R.B, S.B) of two tables.
B, the filtering rule based on maximum value and minimum value:
In order to illustrate the data filtering method based on maximum value and minimum value, we provide one under distributed environment into The example of row non-equivalent Connection inquiring describes the thought of this method.
Assuming that R (A, B) and S (B, C) are two tables to be connected, connection attribute is B, and contiguous function is θ, and θ ∈ >, < ,≤, >=, maximum value and minimum value in R.B record are R.Bmax and R.Bmin respectively, maximum value in S.B record and most Small value is S.Bmax and S.Bmin respectively, then our available following theorems:
Theorem 1: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" > ", then If x > S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y < R.Bmax, Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the minimum value in all records of the attribute column B in S.Bmin representation relation table S, R.Bmax are represented Maximum value in all records of attribute column B in relation table R.The arbitrary value of attribute column B in x representation relation table R, y are represented The arbitrary value of attribute column B in relation table S.
Example is connected, due to contiguous function θ={ > }, Selection theorem 1 is used as filtering rule.
Theorem 1: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" > ", then If x > S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y < R.Bmax, Retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 2: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" < ", then If x < S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y > R.Bmin, Retain y and otherwise filters out the y from the attribute B of relation table S.
Symbol description: the maximum value in all records of the attribute column B in S.Bmax representation relation table S, R.Bmin are represented Minimum value in all records of attribute column B in relation table R.
Theorem 3: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" >=", then If x >=S.Bmin, retain x, otherwise filters out the x from the attribute B of relation table R;If y≤R.Bmax, Then retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 4: for two specified table R (A, B) and S (B, C), connection attribute B, θ="≤", then If x≤S.Bmax, retain x, otherwise filters out the x from the attribute B of relation table R;If y >=R.Bmin, Then retain y and otherwise filters out the y from the attribute B of relation table S.
Theorem 1 to theorem 4 be two tables connection before filtering rule, it is different according to specific contiguous function θ, in order to Illustrate the correctness of theorem, we prove theorem 1, and the proof of theorem 2-4 is the same as theorem 1.
It proves (theorem 1): ifAnd x≤S.Bmin, then forWe can all obtain x≤ Y, this and condition of contact " > " (x > y) be it is contradictory, such x will not generate the result record for meeting condition of contact, Therefore the x should be filtered.It is similar, ifAnd y >=R.Bmax, then forWe will Y >=x (i.e. x≤y) is obtained, this is also contradictory with condition of contact " > " (x > y), therefore such y should also be as being filtered.
C, non-equivalent join algorithm:
Algorithm input: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, contiguous function θ.
Algorithm output: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, for any note in T Record, all meets condition R.B θ S.B (for example, R.B > S.B).
Algorithm flow:
Step1: it from connection column R.B is taken out in tables of data R (A, B), sorts to R.B, traverses R.B, find record set most Big value R.Bmax=5 and minimum value R.Bmin=1;From connection column S.B is taken out in tables of data S (B, C), sort to S.B, traversal S.B finds the maximum value S.Bmax=5 and minimum value S.Bmin=3 of record set;
Step2: because of θ=" > ", R.B and S.B are filtered (because query requirement is defeated according to the filtering rule of theorem 1 Fall out as B, therefore only B operated, reduce the calculating of unrelated column), filter condition R.B>3 (S.Bmin) and S.B<5 (R.Bmax), filtered data set R'.B={ 4,5 } and S'.B={ 3 } are obtained;
Symbol description: R'(A, B) representation relation table R (A, B) filtered a subset, S'(B, C) representation relation table S (B, C) filtered a subset.
If θ=" < ", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 2, obtain filtered number According to collection R'(A, B) and S'(B, C);
If θ=" >=", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 3, obtained filtered Data set R'(A, B) and S'(B, C);
If θ="≤", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 4, obtained filtered Data set R'(A, B) and S'(B, C);
Step3: carrying out distributed computing for filtered R'.B and S'.B, obtain cartesian product result set T'(R.B, S.B)={ (4,3), (5,3) };
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after Result set, the cartesian product of the full attribute column carried out here, thus R and S all properties column be included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out postsearch screening, delete Except ineligible record, final connection result collection T (R.A, R.B, S.B, S.C) is obtained;To T'(R.B, S.B)=(4, 3), (5,3) } in record carry out postsearch screening, interpretation it is whether each record meet (R.B > S.B) condition, most terminated Fruit collects T (R.B, S.B)={ (4,3), (5,3) };
Symbol description: T (R.A, R.B, S.B, S.C) indicate to cartesian product T'(R.A, R.B, S.B, S.C) carry out it is secondary Result set after screening, because filtering rule cannot be guaranteed to screen out all records for not meeting condition of contact, in order to guarantee to tie The correctness of fruit carries out postsearch screening here.
Step5: output non-equivalent result set T (R.B, S.B)={ (4,3), (5,3) }.

Claims (1)

1. a kind of non-equivalent connection method towards magnanimity distributed data, it is characterised in that the following steps are included:
Step 1: assume there are two relation table R (A, B) and S (B, C), function # belong to>,<,>=,≤, inquire QB and be defined as R (A,B)∞R.BθS.BS (B, C), then QB is known as the non-equivalent Connection inquiring connected between relation table R and S by field B;
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, and S (B, C) representation relation table S, B, C are the attributes of S;θ It is the contiguous function of R and S;QB represents one and is related to the inquiry of R and S, and " ∞ " represents bound symbol;Due in distributed environment Under, the computational format of data is key-value form, and therefore, field B regards that the key of R and S, field A regard relation table R as In all fields other than B combination, field C regards the combination of all fields in relation table S other than B as;From And the connection of multiword segment table is converted into the key-value form in distributed computing environment;
Step 2: assume that R (A, B) and S (B, C) are two tables to be connected, connection attribute is B, and contiguous function is θ, and θ ∈ { >, < ,≤, >=}, maximum value and minimum value in R.B record are R.Bmax and R.Bmin, the maximum value in S.B record respectively It is S.Bmax and S.Bmin respectively with minimum value, then obtains following theorem:
Theorem 1: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" > ", thenIf x > S.Bmin then retains x, otherwise filters out the x from the attribute B of relation table R;If y < R.Bmax retains y Otherwise the y is filtered out from the attribute B of relation table S;
Symbol description: the minimum value in all records of the attribute column B in S.Bmin representation relation table S, R.Bmax representation relation Maximum value in all records of attribute column B in table R;The arbitrary value of attribute column B in x representation relation table R, y representation relation The arbitrary value of attribute column B in table S;
Theorem 2: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" < ", thenIf x < S.Bmax then retains x, otherwise filters out the x from the attribute B of relation table R;If y > R.Bmin retains y Otherwise the y is filtered out from the attribute B of relation table S;
Symbol description: the maximum value in all records of the attribute column B in S.Bmax representation relation table S, R.Bmin representation relation Minimum value in all records of attribute column B in table R;
Theorem 3: for two specified table R (A, B) and S (B, C), connection attribute B, θ=" >=", thenIf X >=S.Bmin then retains x, otherwise filters out the x from the attribute B of relation table R;If y≤R.Bmax is protected Y is stayed otherwise to filter out the y from the attribute B of relation table S;
Theorem 4: for two specified table R (A, B) and S (B, C), connection attribute B, θ="≤", thenIf X≤S.Bmax then retains x, otherwise filters out the x from the attribute B of relation table R;If y >=R.Bmin is protected Y is stayed otherwise to filter out the y from the attribute B of relation table S;
Step 3: non-equivalent join algorithm:
Algorithm input: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, contiguous function θ;
Algorithm output: meeting the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, for any record in T, All meet condition R.B θ S.B;
Algorithm flow:
Step1: from connection column R.B is taken out in tables of data R (A, B), sorting to R.B, and traversal R.B record finds R.B record most Big value R.Bmax and minimum value R.Bmin;From connection column S.B is taken out in tables of data S (B, C), sort to S.B, traversal S.B record, Find the maximum value S.Bmax and minimum value S.Bmin of S.B record;
Step2: judging condition of contact θ, if θ=" > ", was carried out according to the filtering rule of theorem 1 to R (A, B) and S (B, C) Filter, obtains filtered data set R'(A, B) and S'(B, C);
Symbol description: R'(A, B) representation relation table R (A, B) filtered a subset, S'(B, C) representation relation table S (B, C) Filtered a subset;
If θ=" < ", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 2, obtain filtered data set R'(A, B) and S'(B, C);
If θ=" >=", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 3, obtain filtered data Collect R'(A, B) and S'(B, C);
If θ="≤", R (A, B) and S (B, C) are filtered according to the filtering rule of theorem 4, obtain filtered data Collect R'(A, B) and S'(B, C);
Step3: carrying out distributed computing for filtered R'(A, B) and S'(B, C), obtain cartesian product result set T'(R.A, R.B,S.B,S.C);
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out the knot after cartesian product calculating The all properties column of fruit collection, the cartesian product of the full attribute column carried out here, therefore R and S are included;
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out postsearch screening, delete not Qualified record obtains final connection result collection T (R.A, R.B, S.B, S.C);
Symbol description: T (R.A, R.B, S.B, S.C) is indicated to cartesian product T'(R.A, R.B, S.B, S.C) carry out postsearch screening Result set afterwards, because filtering rule cannot be guaranteed to screen out all records for not meeting condition of contact, in order to guarantee result Correctness carries out postsearch screening here;
Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).
CN201610312145.4A 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data Active CN106021386B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610312145.4A CN106021386B (en) 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610312145.4A CN106021386B (en) 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data

Publications (2)

Publication Number Publication Date
CN106021386A CN106021386A (en) 2016-10-12
CN106021386B true CN106021386B (en) 2019-02-05

Family

ID=57099280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610312145.4A Active CN106021386B (en) 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data

Country Status (1)

Country Link
CN (1) CN106021386B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268586B (en) * 2017-09-22 2020-06-16 阿里巴巴(中国)有限公司 Data processing method, device, medium and computing equipment across multiple data tables
CN109710643B (en) * 2018-12-20 2020-11-13 上海达梦数据库有限公司 External connection management method, device, server and storage medium
CN110489452A (en) * 2019-08-21 2019-11-22 中国科学院深圳先进技术研究院 Multiplex data stream θ connection optimization method and system
CN112615910B (en) * 2020-12-10 2021-10-29 中国科学院深圳先进技术研究院 Data stream connection optimization method, system, terminal and storage medium
CN112948442B (en) * 2021-03-26 2022-06-21 深圳先进技术研究院 Data stream theta connection optimization method, system, terminal and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MapReduce连接查询的I/O代价研究;宋杰等;《软件学报》;20150630;全文
processing theta-join using mapreduce;Alper Okcan等;《International Conference on Management of data》;20110630;全文
基于MapReduce的大数据连接算法的设计与优化;张常淳;《中国博士论文全文数据库》;20141015;全文

Also Published As

Publication number Publication date
CN106021386A (en) 2016-10-12

Similar Documents

Publication Publication Date Title
CN106021386B (en) Non-equivalent connection method towards magnanimity distributed data
CN106452868B (en) A kind of network flow statistic implementation method for supporting various dimensions polymerization classification
CN105279276A (en) Database index optimization system
CN106254321A (en) A kind of whole network abnormal data stream sorting technique
CN109325062B (en) Data dependency mining method and system based on distributed computation
CN102323942A (en) Statistical query method
CN111125199B (en) Database access method and device and electronic equipment
Vu et al. Using deep learning for big spatial data partitioning
CN111126865A (en) Technology maturity judging method and system based on scientific and technological big data
US20190347302A1 (en) Device, system, and method for determining content relevance through ranked indexes
Ding et al. MO-Tree: an efficient forwarding engine for spatiotemporal-aware pub/sub systems
Doulkeridis et al. On saying" enough already!" in mapreduce
Nam et al. A parallel query processing system based on graph-based database partitioning
CN112329971A (en) Modeling method of investment decision model of power transmission and transformation project
CN111522819A (en) Method and system for summarizing tree-structured data
Gu et al. An improved apriori algorithm
Yu et al. Research on promoting factors of “internet plus innovation and entrepreneurship” based on the perspective of cloud analysis and fuzz analysis method
Zheng et al. User preference-based data partitioning top-k skyline query processing algorithm
US20140089311A1 (en) System. method, and computer-readable medium for classifying problem queries to reduce exception processing
CN104021169B (en) A kind of Hive Connection inquiring methods based on the algorithms of SDD 1
Ramana et al. Methods for mining cross level association rule in taxonomy data structures
Chatzigeorgakidis et al. Scalable hybrid similarity join over geolocated time series
Liu et al. Rare itemsets mining algorithm based on RP-Tree and spark framework
CN109447490A (en) A kind of family change relationship anomalous discrimination method based on station address
KR20110080966A (en) An associative classification method for detecting useful knowledge from huge multi-attributes dataset

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Liu Wenjie

Inventor after: Li Zhanhuai

Inventor after: Pan Wei

Inventor after: Zhang Xiao

Inventor before: Liu Wenjie

Inventor before: Li Zhanhuai

Inventor before: Pan Wei

Inventor before: Zhang Xiao

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210913

Address after: 310000 room 660, building 5, No. 16, Zhuantang science and technology economic block, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Yunyao Technology (Zhejiang) Co.,Ltd.

Address before: 710072 No. 127 Youyi West Road, Shaanxi, Xi'an

Patentee before: Northwestern Polytechnical University

TR01 Transfer of patent right