CN106021386A - Theta-join method for massive distributed data - Google Patents

Theta-join method for massive distributed data Download PDF

Info

Publication number
CN106021386A
CN106021386A CN201610312145.4A CN201610312145A CN106021386A CN 106021386 A CN106021386 A CN 106021386A CN 201610312145 A CN201610312145 A CN 201610312145A CN 106021386 A CN106021386 A CN 106021386A
Authority
CN
China
Prior art keywords
attribute
relation table
bmax
bmin
theorem
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610312145.4A
Other languages
Chinese (zh)
Other versions
CN106021386B (en
Inventor
刘文洁
李占怀
潘巍
张晓�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunyao Technology (Zhejiang) Co.,Ltd.
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201610312145.4A priority Critical patent/CN106021386B/en
Publication of CN106021386A publication Critical patent/CN106021386A/en
Application granted granted Critical
Publication of CN106021386B publication Critical patent/CN106021386B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The invention discloses a theta-join method for massive distributed data, which is used for solving the technical problem of low efficiency of the conventional theta-join method. The method adopts the technical scheme that prior to theta-join of two tables, appropriate filtering rules are firstly selected according to join conditions, the maximum values and the minimum values of join fields of the two tables are then calculated, all records in the two tables are scanned according to the maximum values and the minimum values, the records irrelevant to output results are removed through filtering, Cartesian product calculation is only carried out for filtered data, and a secondary comparison of Cartesian product results is finally carried out according to the join conditions, so that the records satisfying the join conditions are obtained through screening. The method has the advantages that a large number of the records which fail to satisfy the join conditions are removed through the filtering prior to the Cartesian product calculation, so that the workload of a Reducer is effectively reduced, and the theta-join query efficiency is improved.

Description

Non-equivalent method of attachment towards magnanimity distributed data
Technical field
The present invention relates to a kind of non-equivalent method of attachment, particularly to a kind of non-equivalent towards magnanimity distributed data even Connect method.
Background technology
Under cloud computing environment, data are stored, process and analyze and bring new challenge by the explosive growth of data volume. Traditional data base and data processing method cannot meet storage and the process demand of big data, at present, the process of main flow Method is that the parallel processing technique using MapReduce is to improve the processing speed of data.Based on MapReduce Parallel distributed model under, although data can be carried out burst, distributed treatment, but due to attended operation, special It not that the cartesian product result that non-equivalent attended operation (Theta-join) produces can cause the data volume in network and disk Sharply increasing, cause the biggest I/O and disk expense, therefore, treatment effeciency is the lowest.
Document " Alper Okcan.Processing theta-joins using mapreduce [C] .Proceedings of the 2011ACM SIGMOD International Conference on Management of data,P946-960,ACM, 2011 " disclose the effective ways that a kind of non-equivalent processing two tables under distributed environment connects, be referred to as 1-bucket-theta, the method using connection matrix (join-Matrix) as Theta-join link model, two tables Cartesian product is expressed as connection matrix, and on matrix, all of cell is evenly dispersed in cluster by random algorithm Each node in, the live load of each node is of substantially equal.During this method ensure that parallel computation, institute The internodal load balancing having, improves the treatment effeciency of connection.But, owing to cartesian product is join algorithm Complete or collected works, for non-equivalent connects, the most all of cell all can produce the final result of output, therefore for For the inquiry that selectivity (selectivity) is little, this algorithm will have substantial amounts of node and carry out non-productive work, not Produce output result, therefore efficiency is the lowest.
Summary of the invention
In order to overcome the existing inefficient deficiency of non-equivalent method of attachment, the present invention provides a kind of towards the distributed number of magnanimity According to non-equivalent method of attachment.The method, before the non-equivalent carrying out two tables connects, first selects according to condition of contact Suitably filtering rule, then calculates maximum and the minima of two table linkage fields, according to maximum and minima pair All records in two tables are scanned, and are filtered by the record unrelated with output result, only to the number after filtering According to carrying out the calculating of cartesian product, finally according to condition of contact, the result of cartesian product is carried out secondary contrast, filter out Meet the record of condition of contact.The method can filter out and substantial amounts of not meet condition of contact before doing cartesian product Record, the workload of Reducer can be effectively reduced, improve non-equivalent connect search efficiency.
The technical solution adopted for the present invention to solve the technical problems: a kind of non-equivalent towards magnanimity distributed data is even Connect method, be characterized in comprising the following steps:
Step one, assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry QB is defined as R (A, B) ∞R.BθS.BS (B, C), then QB is referred to as between relation table R and S one connected by field B Individual non-equivalent Connection inquiring.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B, C is the attribute of S.θ is the connectivity function of R and S.QB represents an inquiry relating to R and S, and " ∞ " represents even Connect symbol.Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B is seen Work is the key of R and S, and field A regards the combination of all fields in relation table R in addition to B, field C as Regard in relation table S the combination of all fields in addition to B as.Thus the connection of multiword segment table is converted to distribution Key-value form in formula computing environment.
Step 2, assuming that R (A, B) and S (B, C) is two tables to be connected, connection attribute is B, and connectivity function is θ, And θ ∈ >, < ,≤, >=, maximum and minima in R.B record are in R.Bmax and R.Bmin, S.B record respectively Maximum and minima be S.Bmax and S.Bmin respectively, then obtain following theorem:
Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If y < R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the minima in all records of attribute column B in S.Bmin representation relation table S, R.Bmax Maximum in all records of attribute column B in representation relation table R.Attribute column B in x representation relation table R Arbitrary value, the arbitrary value of attribute column B in y representation relation table S.
Theorem 2: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" < ", then If x < S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y > R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the maximum in all records of attribute column B in S.Bmax representation relation table S, R.Bmin Minima in all records of attribute column B in representation relation table R.
Theorem 3: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" >=", then If x >=S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y≤R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 4: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ="≤", then If x≤S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y >=R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Step 3, non-equivalent join algorithm:
Algorithm inputs: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, connectivity function θ.
Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T Record, all meets condition R.B θ S.B.
Algorithm flow:
Step1: take out from tables of data R (A, B) and connect row R.B, R.B is sorted, travel through R.B, find record set Maximum R.Bmax and minima R.Bmin;Take out from tables of data S (B, C) and connect row S.B, S.B is sorted, Traversal S.B, finds maximum S.Bmax and minima S.Bmin of record set;
Step2: judge condition of contact θ, if θ=" > ", according to the filtering rule of theorem 1 to R (A, B) and S (B, C) Filter, the data set R'(A, B after being filtered) and S'(B, C);
Symbol description: R'(A, B) representation relation table R (A, B) filter after a subset, S'(B, C) representation relation table S (B, C) A subset after filtration.
If θ=" R (A, B) and S (B, C) is filtered, after being filtered by < " according to the filtering rule of theorem 2 Data set R'(A, B) and S'(B, C);
If θ=" >=", according to the filtering rule of theorem 3, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C);
If θ="≤", according to the filtering rule of theorem 4, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C);
Step3: will filter after R'(A, B) and S'(B, C) carry out Distributed Calculation, obtain cartesian product result set T'(R.A, R.B,S.B,S.C);
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after Result set, the cartesian product of the full attribute column carried out here, therefore all properties row of R and S are included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out secondary sieve Choosing, deletes ineligible record, obtains final connection result collection T (R.A, R.B, S.B, S.C);
Symbol description: T (R.A, R.B, S.B, S.C) represents cartesian product T'(R.A, R.B, S.B, S.C) carry out secondary Result set after screening, because filtering rule is it cannot be guaranteed that screen out all records not meeting condition of contact, in order to protect The correctness of card result, carries out postsearch screening here.
Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).
The invention has the beneficial effects as follows: the method is before the non-equivalent carrying out two tables connects, first according to condition of contact Select suitable filtering rule, then calculate maximum and the minima of two table linkage fields, according to maximum and minimum All records in two tables are scanned by value, will filter with the output unrelated record of result, only to filtering after Data carry out the calculating of cartesian product, finally according to condition of contact, the result of cartesian product is carried out secondary contrast, sieve Select the record meeting condition of contact.The method can filter out and substantial amounts of not meet connection before doing cartesian product The record of condition, thus significantly reduce the workload of Reducer, improve the search efficiency that non-equivalent connects.
With reference to Fig. 1.Below by an example, beneficial effects of the present invention is described.
Assuming two tables of data R (A, B) and S (B, C), connection attribute B, condition of contact is " > ", the SQL language of inquiry Sentence is:
Select R.B,S.B
From R,S
Where R.B>S.B;
R has 5 records, and S has 3 records, and the result set of the cartesian product of original R and S is 5*3=15 bar note Record.First find maximum and the minima of R.B, be 5 and 1 respectively;The maximum of S.B and minima, be respectively 5 and 3.Owing to connectivity function θ is " > ", therefore according to theorem 1, for R.B arbitrarily records x, it is judged that (x > S.Bmin), i.e. x > 3, retain all records meeting this condition, a subset of R.B can be obtained R'.B={4,5}.Same, y is arbitrarily recorded for S.B, it is judged that (y < R.Bmax), i.e. y < 5, obtain S.B Subset S'.B={3}.R'.B and S'.B is carried out cartesian product, and result set is 2*1=2 bar record, is respectively { (4,3), (5,3) }, through postsearch screening, this result set is the record meeting condition of contact, exports this result set.Logical Crossing the improvement of the inventive method, the calculation cost of cartesian product is reduced to 2 records from 15, improves ratio and reaches (15-2)/15=86%.
Under the distributed type assemblies environment of real 16 nodes, generate 1G-100G data set with TPC-H, use The method of the present invention and the present invention quote the method for document and compare various non-equivalent Connection inquiring, side of the present invention Method is substantially better than literature method, and search efficiency can accelerate about 40-60 times.
With detailed description of the invention, the present invention is elaborated below in conjunction with the accompanying drawings.
Accompanying drawing explanation
Fig. 1 is that the present invention non-equivalent method of attachment towards magnanimity distributed data carries out non-equivalent under distributed environment The exemplary plot of Connection inquiring.Table to be checked is R (A, B) and S (B, C), and connection attribute is classified as B, connects Condition is R.B > S.B.Intended output is recorded as (4,3), (5,3).The maximum of the attribute B in R table is 5, minimum Value is 1.In S table, the maximum of attribute B is 5, and minima is 3.
Detailed description of the invention
The present invention specifically comprises the following steps that towards the non-equivalent method of attachment of magnanimity distributed data
A, Theta-Join are query-defined:
Assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry QB is defined as R(A,B)∞R.BθS.BS (B, C), then the non-equivalent that QB is referred to as between relation table R and S by field B connects connects Connect inquiry.
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B, C is the attribute of S.θ is the connectivity function of R and S.QB represents an inquiry relating to R and S, and " ∞ " represents even Connect symbol.Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B can To regard the key of R and S as, field A can be regarded as the combination of all fields in relation table R in addition to B (value) combination (value) of all fields in addition to B during, field C can be regarded as relation table S.Thus The connection making multiword segment table can be with the key-value form being converted in distributed computing environment.
Such as, relation table R (A, B) to be checked comprises attribute A, B, relation table S (B, C) comprise attribute B, C.Downlink connection function #={ > }, non-equivalent connection is classified as B.Inquiry QBIt is defined as R (A, B) ∞R.B > S.BS(B,C).Want Ask connection row (R.B, S.B) of output row only two tables of output of inquiry.
B, filtering rule based on maximum and minima:
In order to the data filtering method based on maximum and minima is described, we provide one and enter under distributed environment The example of row non-equivalent Connection inquiring describes the thought of the method.
Assuming that R (A, B) and S (B, C) is two tables to be connected, connection attribute is B, and connectivity function is θ, and θ ∈ >, < ,≤, >=, maximum and minima in R.B record are in R.Bmax and R.Bmin, S.B record respectively Maximum and minima are S.Bmax and S.Bmin respectively, then we can obtain following theorem:
Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If y < R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the minima in all records of attribute column B in S.Bmin representation relation table S, R.Bmax Maximum in all records of attribute column B in representation relation table R.Attribute column B in x representation relation table R Arbitrary value, the arbitrary value of attribute column B in y representation relation table S.
Connect example, due to connectivity function θ={ > }, therefore Selection theorem 1 is as filtering rule.
Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If y < R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 2: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" < ", then If x < S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y > R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Symbol description: the maximum in all records of attribute column B in S.Bmax representation relation table S, R.Bmin Minima in all records of attribute column B in representation relation table R.
Theorem 3: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" >=", then If x >=S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y≤R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 4: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ="≤", then If x≤S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y >=R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S.
Theorem 1 is to the filtering rule that theorem 4 is before two tables connect, different according to concrete connectivity function θ, In order to the correctness of theorem is described, theorem 1 is proved by we, and the proof of theorem 2-4 is with theorem 1.
Prove (theorem 1): ifAnd x≤S.Bmin, then forWe can obtain x≤y, This and condition of contact " > " (x > y) be contradiction, such x will not produce a result note meeting condition of contact Record, therefore this x should be filtered.Similar, ifAnd y >=R.Bmax, then for We will obtain y >=x (i.e. x≤y), this also with condition of contact " > " (x > y) be contradiction, the most such y Should also be as being filtered.
C, non-equivalent join algorithm:
Algorithm inputs: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, connectivity function θ.
Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T Record, all meets condition R.B θ S.B (such as, R.B > S.B).
Algorithm flow:
Step1: take out from tables of data R (A, B) and connect row R.B, R.B is sorted, travel through R.B, find record set Maximum R.Bmax=5 and minima R.Bmin=1;Take out from tables of data S (B, C) and connect row S.B, to S.B Sequence, travels through S.B, finds maximum S.Bmax=5 and minima S.Bmin=3 of record set;
Step2: because θ=" > ", R.B and S.B is filtered (because of search request according to the filtering rule of theorem 1 Output is classified as B, therefore only operates B, reduces the calculating of unrelated row), filtercondition is R.B > 3 (S.Bmin) With S.B < 5 (R.Bmax), data set R'.B={4,5} and S'.B={3} after being filtered;
Symbol description: R'(A, B) representation relation table R (A, B) filter after a subset, S'(B, C) representation relation table S (B, C) A subset after filtration.
If θ=" R (A, B) and S (B, C) is filtered, after being filtered by < " according to the filtering rule of theorem 2 Data set R'(A, B) and S'(B, C);
If θ=" >=", according to the filtering rule of theorem 3, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C);
If θ="≤", according to the filtering rule of theorem 4, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C);
Step3: R'.B and S'.B after filtering carries out Distributed Calculation, obtains cartesian product result set T'(R.B, S.B)={ (4,3), (5,3) };
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after Result set, the cartesian product of the full attribute column carried out here, therefore all properties row of R and S are included.
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out secondary sieve Choosing, deletes ineligible record, obtains final connection result collection T (R.A, R.B, S.B, S.C);Right T'(R.B, S.B)={ (4,3), (5,3) } in record carry out postsearch screening, the most each record of interpretation meets (R.B > S.B) Condition, obtains final result collection T (R.B, S.B)={ (4,3), (5,3) };
Symbol description: T (R.A, R.B, S.B, S.C) represents cartesian product T'(R.A, R.B, S.B, S.C) carry out secondary Result set after screening, because filtering rule is it cannot be guaranteed that screen out all records not meeting condition of contact, in order to protect The correctness of card result, carries out postsearch screening here.
Step5: output non-equivalent result set T (R.B, S.B)={ (4,3), (5,3) }.

Claims (1)

1. the non-equivalent method of attachment towards magnanimity distributed data, it is characterised in that comprise the following steps:
Step one, assuming have two relation table R (A, B) and S (B, C), function # belongs to {>,<,>=,≤, inquiry QB is defined as R (A, B) ∞R.BθS.BS (B, C), then QB is referred to as between relation table R and S one connected by field B Individual non-equivalent Connection inquiring;
Symbol description: R (A, B) representation relation table R, A, B are the attributes of R, S (B, C) representation relation table S, B, C is the attribute of S;θ is the connectivity function of R and S;QB represents an inquiry relating to R and S, and " ∞ " represents even Connect symbol;Owing to, under distributed environment, the computation scheme of data is key-value form, and therefore, field B is seen Work is the key of R and S, and field A regards the combination of all fields in relation table R in addition to B, field C as Regard in relation table S the combination of all fields in addition to B as;Thus the connection of multiword segment table is converted to distribution Key-value form in formula computing environment;
Step 2, assuming that R (A, B) and S (B, C) is two tables to be connected, connection attribute is B, and connectivity function is θ, And θ ∈ >, < ,≤, >=, maximum and minima in R.B record are in R.Bmax and R.Bmin, S.B record respectively Maximum and minima be S.Bmax and S.Bmin respectively, then obtain following theorem:
Theorem 1: for two the table R (A, B) specified and S (B, C), connection attribute is B, θ=" > ", then If x > S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If y < R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S;
Symbol description: the minima in all records of attribute column B in S.Bmin representation relation table S, R.Bmax Maximum in all records of attribute column B in representation relation table R;Attribute column B in x representation relation table R Arbitrary value, the arbitrary value of attribute column B in y representation relation table S;
Theorem 2: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" < ", then If x < S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y > R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S;
Symbol description: the maximum in all records of attribute column B in S.Bmax representation relation table S, R.Bmin Minima in all records of attribute column B in representation relation table R;
Theorem 3: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ=" >=", then If x >=S.Bmin, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y≤R.Bmax, then retain y and otherwise filter out this y from the attribute B of relation table S;
Theorem 4: for two the table R (A, B) specified and S (B, C), connection attribute be B, θ="≤", then If x≤S.Bmax, then retain x, from the attribute B of relation table R, otherwise filter out this x;If Y >=R.Bmin, then retain y and otherwise filter out this y from the attribute B of relation table S;
Step 3, non-equivalent join algorithm:
Algorithm inputs: two tables of data R (A, B) to be connected and S (B, C), connection attribute B, connectivity function θ;
Algorithm exports: meet the result set T (R.A, R.B, S.B, S.C) of non-equivalent condition of contact, any in T Record, all meets condition R.B θ S.B;
Algorithm flow:
Step1: take out from tables of data R (A, B) and connect row R.B, R.B is sorted, travel through R.B, find record set Maximum R.Bmax and minima R.Bmin;Take out from tables of data S (B, C) and connect row S.B, S.B is sorted, Traversal S.B, finds maximum S.Bmax and minima S.Bmin of record set;
Step2: judge condition of contact θ, if θ=" > ", according to the filtering rule of theorem 1 to R (A, B) and S (B, C) Filter, the data set R'(A, B after being filtered) and S'(B, C);
Symbol description: R'(A, B) representation relation table R (A, B) filter after a subset, S'(B, C) representation relation table S (B, C) A subset after filtration;
If θ=" R (A, B) and S (B, C) is filtered, after being filtered by < " according to the filtering rule of theorem 2 Data set R'(A, B) and S'(B, C);
If θ=" >=", according to the filtering rule of theorem 3, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C);
If θ="≤", according to the filtering rule of theorem 4, R (A, B) and S (B, C) is filtered, after being filtered Data set R'(A, B) and S'(B, C);
Step3: will filter after R'(A, B) and S'(B, C) carry out Distributed Calculation, obtain cartesian product result set T'(R.A, R.B,S.B,S.C);
Symbol description: T'(R.A, R.B, S.B, S.C) represent to R'(A, B) and S'(B, C) carry out cartesian product calculating after Result set, the cartesian product of the full attribute column carried out here, therefore all properties row of R and S are included;
Step4: according to condition of contact θ, to T'(R.A, R.B, S.B, S.C) in R.B and S.B carry out secondary sieve Choosing, deletes ineligible record, obtains final connection result collection T (R.A, R.B, S.B, S.C);
Symbol description: T (R.A, R.B, S.B, S.C) represents cartesian product T'(R.A, R.B, S.B, S.C) carry out secondary Result set after screening, because filtering rule is it cannot be guaranteed that screen out all records not meeting condition of contact, in order to protect The correctness of card result, carries out postsearch screening here;
Step5: output non-equivalent result set T (R.A, R.B, S.B, S.C).
CN201610312145.4A 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data Active CN106021386B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610312145.4A CN106021386B (en) 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610312145.4A CN106021386B (en) 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data

Publications (2)

Publication Number Publication Date
CN106021386A true CN106021386A (en) 2016-10-12
CN106021386B CN106021386B (en) 2019-02-05

Family

ID=57099280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610312145.4A Active CN106021386B (en) 2016-05-12 2016-05-12 Non-equivalent connection method towards magnanimity distributed data

Country Status (1)

Country Link
CN (1) CN106021386B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019056964A1 (en) * 2017-09-22 2019-03-28 广东神马搜索科技有限公司 Cross-multiple-data table data processing method, device, medium and computing apparatus
CN109710643A (en) * 2018-12-20 2019-05-03 上海达梦数据库有限公司 Outer connecting pipe manages method, apparatus, server and storage medium
CN110489452A (en) * 2019-08-21 2019-11-22 中国科学院深圳先进技术研究院 Multiplex data stream θ connection optimization method and system
CN112948442A (en) * 2021-03-26 2021-06-11 深圳先进技术研究院 Data stream theta connection optimization method, system, terminal and storage medium
WO2022121154A1 (en) * 2020-12-10 2022-06-16 中国科学院深圳先进技术研究院 Data stream connection optimization method, system, terminal, and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALPER OKCAN等: "processing theta-join using mapreduce", 《INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA》 *
宋杰等: "MapReduce连接查询的I/O代价研究", 《软件学报》 *
张常淳: "基于MapReduce的大数据连接算法的设计与优化", 《中国博士论文全文数据库》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019056964A1 (en) * 2017-09-22 2019-03-28 广东神马搜索科技有限公司 Cross-multiple-data table data processing method, device, medium and computing apparatus
CN109710643A (en) * 2018-12-20 2019-05-03 上海达梦数据库有限公司 Outer connecting pipe manages method, apparatus, server and storage medium
CN110489452A (en) * 2019-08-21 2019-11-22 中国科学院深圳先进技术研究院 Multiplex data stream θ connection optimization method and system
WO2022121154A1 (en) * 2020-12-10 2022-06-16 中国科学院深圳先进技术研究院 Data stream connection optimization method, system, terminal, and storage medium
CN112948442A (en) * 2021-03-26 2021-06-11 深圳先进技术研究院 Data stream theta connection optimization method, system, terminal and storage medium
CN112948442B (en) * 2021-03-26 2022-06-21 深圳先进技术研究院 Data stream theta connection optimization method, system, terminal and storage medium

Also Published As

Publication number Publication date
CN106021386B (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN106021386A (en) Theta-join method for massive distributed data
US9454574B2 (en) Bloom filter costing estimation
US9535956B2 (en) Efficient set operation execution using a single group-by operation
Zhang et al. A highly optimized algorithm for continuous intersection join queries over moving objects
CN110471916A (en) Querying method, device, server and the medium of database
Stocker et al. Integrating semi-join-reducers into state-of-the-art query processors
CN110222029A (en) A kind of big data multidimensional analysis computational efficiency method for improving and system
CN110275929B (en) Candidate road section screening method based on grid segmentation and grid segmentation method
CN106254321A (en) A kind of whole network abnormal data stream sorting technique
CN102323942B (en) Statistical query method
CN104504154A (en) Method and device for data aggregate query
CN110909111A (en) Distributed storage and indexing method based on knowledge graph RDF data characteristics
CN104834754A (en) SPARQL semantic data query optimization method based on connection cost
CN103902742A (en) Access control determination engine optimization system and method based on big data
WO2013078478A1 (en) Improved database query optimization and cost estimation
CN108520035A (en) SPARQL parent map pattern query processing methods based on star decomposition
Silva et al. Database similarity join for metric spaces
US20190347302A1 (en) Device, system, and method for determining content relevance through ranked indexes
CN109783696A (en) A kind of multi-mode index of the picture construction method and system towards weak structure correlation
CN111125199B (en) Database access method and device and electronic equipment
CN109062949A (en) A kind of method of multi-table join search efficiency in raising Online aggregate
Zheng et al. User preference-based data partitioning top-k skyline query processing algorithm
Ramana et al. Methods for mining cross level association rule in taxonomy data structures
CN104021169B (en) A kind of Hive Connection inquiring methods based on the algorithms of SDD 1
CN107133281A (en) A kind of packet-based global multi-query optimization method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Liu Wenjie

Inventor after: Li Zhanhuai

Inventor after: Pan Wei

Inventor after: Zhang Xiao

Inventor before: Liu Wenjie

Inventor before: Li Zhanhuai

Inventor before: Pan Wei

Inventor before: Zhang Xiao

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210913

Address after: 310000 room 660, building 5, No. 16, Zhuantang science and technology economic block, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Yunyao Technology (Zhejiang) Co.,Ltd.

Address before: 710072 No. 127 Youyi West Road, Shaanxi, Xi'an

Patentee before: Northwestern Polytechnical University

TR01 Transfer of patent right