CN115858523A - Hash Join execution method for detecting tilt data - Google Patents
Hash Join execution method for detecting tilt data Download PDFInfo
- Publication number
- CN115858523A CN115858523A CN202211418045.1A CN202211418045A CN115858523A CN 115858523 A CN115858523 A CN 115858523A CN 202211418045 A CN202211418045 A CN 202211418045A CN 115858523 A CN115858523 A CN 115858523A
- Authority
- CN
- China
- Prior art keywords
- data set
- tilt
- data
- hash
- join
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000004364 calculation method Methods 0.000 claims abstract description 14
- 238000012935 Averaging Methods 0.000 claims 1
- 238000001514 detection method Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 2
- 241000764238 Isis Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Hash Join execution method for detecting tilt data, which relates to the technical field of distributed databases and comprises the following steps: acquiring two input data sets; setting a relative inclination rate, calculating the product of the relative inclination rate and the data volume in one of the input data sets to obtain an inclination threshold value, wherein the value of the inclination threshold value exceeding the appearance frequency in the input data set is called as an inclination value; checking whether a Skew value exists in an input data set with large total data volume or not, if not, respectively performing Hash distribution on the two input data sets, if so, detecting all Skew values of the input data set in a certain field to obtain a Skew value list, then splitting each input data set into a Normal data set without the Skew value and a Sew data set with the Skew value based on the Skew value list, performing Hash distribution on the Normal data set, and performing average distribution or mirror distribution on the Sew data set; nodes in the cluster perform Hash Join calculations on the data thereon. The invention can improve the execution efficiency of the Hash Join.
Description
Technical Field
The invention relates to the technical field of distributed databases, in particular to a Hash Join execution method for detecting tilt data.
Background
In the field of databases, particularly databases with query as a main function, the response speed of the database is an important factor influencing user experience, but the rapid development of the internet nowadays generates a large amount of data, and in the face of such a large amount of data, a distributed technology is widely adopted in the field of databases.
In the database's SQL engine there is a Join operator whose role is to Join computations, i.e., to find records where two sets are equal in some fields. The operator generally has two input data sets, each input data set comprises a plurality of records, each record comprises a plurality of fields, all the records have the same format, a data set is output after calculation of the Join operator, the data set comprises a plurality of records, each record is formed by combining one record in the two input data sets respectively, and the combination condition is that the two records have the same value in one or more fields.
There are generally three methods for implementing the calculation of the Join operator, namely, hash Join, mesh Join and Nested Loop Join, and the description is mainly made for the Hash Join. The principle of the naive Hash Join method is mainly divided into two stages: the method comprises the following steps of (1) table building step and (2) detection step. The two phases each use one input data set. In the table building phase, each record of the input data set is inserted into a hash table, where key in the hash table represents the value of the field to Join and value represents the record. In the detection stage, whether the value of the corresponding field of each record of the input data set exists in the hash table or not is inquired, and the existing record and the record are collected and combined to form a result record.
In a distributed environment, a plurality of computing nodes are often provided, and it is obviously not appropriate to use only one node to perform Hash Join, so that not only is the response speed poor, but also computing resources are wasted. Most current distributed databases do this by hashing all records of two input data sets onto multiple compute nodes using the same Hash function acting on specified fields, and then performing a naive Hash Join computation. The method is simple in logic, has certain deployment advantages on large-scale database engineering projects, obviously improves the calculation performance compared with single-node Join, and seriously reduces the performance improvement under the condition of input data inclination. When the input data set is inclined, a large number of same values can be obtained on a certain field, the same calculation results can be obtained when the same Hash function acts on the values, so that the same calculation results are distributed to the same calculation node, and finally, the situation that the input data of the node is many and the input data of other nodes is few is formed, so that the execution time of the whole Hash Join is prolonged. In response to this problem, experts in the current field have also studied some solutions, and it is mature that a scheme based on statistical information, such as "Xu Y, kostamaa P, zhou X, et al. Handling data skew in parallel joints in shared-not-doing systems [ C ]// Proceedings of the 2008acm sigma signal international conference on Management of data.2008", proposes an algorithm of PRPD, which specially processes the tilt data of two input data sets, but more so, it is a starting point for the purpose of reducing the network transmission amount, and it is assumed that the tilt data is known.
In contrast, a Hash Join execution method for detecting tilt data is provided for a general data scene, namely, the situation that whether the data is tilted or not and the specific tilt degree are unknown.
Disclosure of Invention
The invention provides a Hash Join execution method for detecting tilt data, which aims to solve the problem that a distributed Hash Join operator is poor in performance on the tilt data.
The invention discloses a Hash Join execution method for detecting tilt data, which adopts the following technical scheme for solving the technical problems:
a Hash Join execution method for detecting tilt data, the implementation comprising:
before executing Hash Join by using an SQL engine of a database, acquiring two input data sets of the Join operator;
setting a relative inclination rate, calculating the product of the relative inclination rate and the data volume in one of the input data sets to obtain an inclination threshold value, wherein the value of the inclination threshold value exceeding the appearance frequency of the two input data sets is called as an inclination value;
for both input data sets, checking whether a skew value exists in the input data set having a large total data amount,
if not, respectively carrying out hash distribution on the two input data sets so as to split the two input data sets and respectively distribute the two input data sets to the Join node cluster,
if the input data set exists, detecting all tilt values of the input data set with large total data volume in a certain field to obtain a tilt value list, splitting each input data set into a Normal data set without tilt values and a Skew data set with tilt values based on the tilt value list, carrying out Hash distribution on the Normal data set to split and distribute the Normal data set into a Join node cluster, carrying out average distribution or mirror distribution on the Skew data set to correspondingly distribute the Skew data set to all nodes of the Join node cluster after splitting and average distribution or copying;
nodes in the Join node cluster perform Hash Join calculation on data on the nodes.
Specifically, for two input datasets, an input dataset having a large total data amount is referred to as a Big dataset, and a dataset having a Small total data amount is referred to as a Small dataset.
More specifically, calculating the product of the relative inclination rate and the sampling data quantity in the Big data set to obtain an inclination threshold value, wherein the value of the inclination threshold value exceeding the appearance frequency in the Big data set and the Small data set is called the inclination value;
and using a checker to check whether the Big data set has a tilt value in a certain field, and if the Big data set has the tilt value, using a detector to count all the tilt values of the Big data set in the certain field to obtain a tilt value list.
Preferably, the checker first sequentially samples the Big data set, calculates the amount and relative tilt rate of the sampled data to obtain the tilt threshold, and then checks whether the tilt value exists in the sampled data.
More specifically, according to the tilt value list, splitting all tilt values in the Big data set into a Skaew 1 data set, splitting the rest data in the Big data set into a Normal 1 data set, simultaneously splitting all tilt values in the Small data set into a Skaew 2 data set, and splitting the rest data in the Small data set into a Normal 2 data set;
hash router is used for carrying out Hash distribution on the Normal 1 data set and the Normal 2 data set respectively, so that the Normal 1 data set and the Normal 2 data set are split and distributed to the Join node cluster respectively, average router is used for carrying out average distribution on the Skaw 1 data set, so that the Skaw 1 data set is split and distributed to all nodes of the Join node cluster averagely, mirror router is used for carrying out mirror image distribution on the Skaw 2 data set, so that the Skaw 2 data set is copied into multiple parts and distributed to all nodes of the Join node cluster correspondingly.
Preferably, when the hash router is used to perform hash distribution on the Normal 1 dataset and the Normal 2 dataset, the same hash function is used to perform calculation on some fields in the Normal 1 dataset and the Normal 2 dataset, the Normal 1 dataset and the Normal 2 dataset are split into a plurality of subsets according to the calculation result, and the plurality of subsets of the Normal 1 dataset and the plurality of subsets of the Normal 2 dataset are distributed to corresponding nodes of the Join node cluster respectively.
Preferably, when the average router is used to distribute the Skew 1 data set evenly, the Skew 1 data set is split into a plurality of subsets according to a certain field, in this case, a plurality of pieces of data with the same value on the field are distributed evenly on the plurality of subsets, and then the plurality of subsets are distributed evenly on all nodes of the Join node cluster.
Preferably, the relative tilt rate is a value less than 1, and the smaller the relative tilt rate, the more tilt values in the input data set, and conversely, the larger the relative tilt rate, the less tilt values in the input data set.
Compared with the prior art, the Hash Join execution method for detecting the tilt data has the following beneficial effects that:
the invention can realize data skew detection, and adopts Hash distribution, mirror image distribution and average distribution after splitting the input data set containing skew data to distribute the data to the computing nodes of the Join node cluster, thereby improving the execution efficiency of Hash Join in the distributed database through the balanced distribution of tasks.
Drawings
FIG. 1 is a flowchart illustrating steps S1-S3 according to a first embodiment of the present invention;
FIG. 2 is a flowchart of the step S4 according to the first embodiment of the present invention;
FIG. 3 is a flowchart of the step S5 according to the first embodiment of the present invention;
fig. 4 is a flowchart of steps S6 to S7 according to a first embodiment of the present invention.
Detailed Description
In order to make the technical solutions, technical problems to be solved, and technical effects of the present invention more clearly apparent, the following description clearly describes the technical solutions of the present invention in combination with specific embodiments.
The first embodiment is as follows:
with reference to fig. 1-4, this embodiment provides a Hash Join execution method for detecting tilt data, which includes:
step S1, before executing Hash Join by using an SQL engine of a database, acquiring two input data sets of the Join operator, wherein the input data set with large total data volume is called a Big data set, and the data set with Small total data volume is called a Small data set.
And S2, setting a relative inclination rate, wherein the product of the relative inclination rate and the total data volume is an inclination threshold value based on the total data volume of the Big data set, and the value of the appearance frequency exceeding the inclination threshold value in the Big data set and the Small data set is called an inclination value.
And S3, sequentially sampling the Big data set by using a checker, calculating the amount and the relative inclination rate of the sampled data to obtain an inclination threshold, checking whether the inclination value exists in the sampled data, executing the step S4 if the inclination value does not exist, and executing the step S5 if the inclination value exists.
And S4, respectively carrying out hash distribution on the Big data set and the Small data set by using a hash router so as to split the Big data set and the Small data set and respectively distribute the split data sets to the Join node cluster, and skipping to execute the step S8.
And S5, detecting all the inclination values of the Big data set in a certain field by using a detector to obtain an inclination value list, and sequentially executing the step S6.
And S6, splitting all the Skew values in the Big data set into a Skaew 1 data set, splitting the rest data in the Big data set into a Normal 1 data set, splitting all the Skew values in the Small data set into a Skaew 2 data set, and splitting the rest data in the Small data set into a Normal 2 data set according to the Skew value list.
S7, respectively carrying out hash distribution on the Normal 1 data set and the Normal 2 data set by using a hash router so as to split the Normal 1 data set and the Normal 2 data set and respectively distribute the split data to the Join node cluster;
using an average router to averagely distribute the Skaew 1 data set so as to split the Skaew 1 data set and averagely distribute the Skaew 1 data set to all nodes of the Join node cluster;
mirror image distribution is carried out on the Skaew 2 data set by using a mirrorrouter, so that the Skaew 2 data set is copied into multiple copies and correspondingly distributed to all nodes of the Join node cluster.
And S8, carrying out Hash Join calculation on the data on the nodes in the Join node cluster.
It should be noted that the relative tilt rate is a value smaller than 1, and the smaller the relative tilt rate, the more tilt values in the input data set, and conversely, the larger the relative tilt rate, the less tilt values in the input data set.
For convenience, the implementation of the flow from step S5 to step S7 in this embodiment is referred to as Detect Join process.
For the implementation process of this embodiment, a distributed Hash Join calculation method widely used by the current distributed database is theoretically compared with the method of this embodiment: assuming that the data size of the Big dataset is m and the data size of the Small dataset is n, the time complexity of the standalone Hash Join is O (m + n). For the distributed Hash Join operator, assuming that k computing nodes exist, the scale of the Big data sets on the k nodes is m 1 ,m 2 ...m k Small dataSet sizes are n 1 ,n 2 ...n k Then the time complexity is O (max (m) i )+max(n i )). For Detect Join, because the tilt values in the Big data set are evenly distributed over k compute nodes, the whole Big data set is evenly distributed over k compute nodes, so the time complexity isIs obviously->Therefore, the Detect Join performance of the embodiment is better than that of the distributed Hash Join operator, and is equal to max (m) i ) The larger the performance improvement is.
After the theoretical analysis is completed, taking the example that the Join node cluster comprises three computing nodes, and then performing experimental verification on the distributed Hash Join computing method widely adopted by the current distributed database and the method of the embodiment.
The degree of inclination is denoted by the letter p, the degree of inclination = total amount of inclination value in data set/amount of data in data set, where p1 denotes the degree of inclination of Small data set and p2 denotes the degree of inclination of Big data set, and the numerical unit is ms.
(one) suppose that the aggregate data amount of Big data is three times that of Small data
p1=0, and p1< p2, the experimental results are shown in table 1:
p2=0 | p2=0.2 | p2=0.4 | p2=0.6 | p2=0.8 | p2=1 | |
current Hash Join | 4061 | 4022 | 3956 | 4098 | 4252 | 4347 |
Detect Join | 4144 | 3780 | 3734 | 3457 | 3446 | 3434 |
When p1= p2= p, the experimental results are shown in table 2:
p=0 | p=0.2 | p=0.4 | p=0.6 | p=0.8 | p=1 | |
current Hash Join | 4061 | 3788 | 3764 | 3772 | 3673 | 3599 |
Detect Join | 4144 | 3523 | 3500 | 3553 | 3627 | 3641 |
p2=0, p1> p2, the experimental results are shown in table 3:
p1=0 | p1=0.2 | p1=0.4 | p1=0.6 | p1=0.8 | p1=1 | |
current Hash Join | 4061 | 3772 | 3718 | 3583 | 3546 | 3393 |
Detect Join | 4144 | 3865 | 3818 | 3473 | 3325 | 3192 |
As can be seen from the data in tables 1 to 3, as long as p2 is greater than 0, the performance of the method of the embodiment is better than that of the distributed Hash Join widely adopted in the current distributed database.
(II) suppose that the Small data lumped data amount is 0.8 times of the Big data lumped data amount
p1=0, and p1< p2, the experimental results are shown in table 4:
p2=0 | p2=0.2 | p2=0.4 | p2=0.6 | p2=0.8 | p2=1 | |
current Hash Join | 7112 | 7226 | 7887 | 7978 | 8233 | 8050 |
Detect Join | 7527 | 7782 | 7347 | 6279 | 6164 | 6212 |
When p1= p2= p, the experimental results are shown in table 5:
p=0 | p=0.2 | p=0.4 | p=0.6 | p=0.8 | p=1 | |
current Hash Join | 7112 | 7750 | 7064 | 6660 | 5524 | 4682 |
Detect Join | 7527 | 6971 | 6434 | 5752 | 4806 | 4756 |
p2=0, p1> p2, the experimental results are shown in table 6:
p1=0 | p1=0.2 | p1=0.4 | p1=0.6 | p1=0.8 | p1=1 | |
current Hash Join | 7112 | 7349 | 7078 | 6273 | 5615 | 5157 |
Detect Join | 7527 | 7362 | 6361 | 5927 | 5346 | 5325 |
Looking at the data in tables 4-6, as long as p2>0, the Detect Join performance of the embodiment will perform better than the distributed Hash Join performance widely adopted by the current distributed database.
By combining the above experimental data, in a general data scene, that is, in a case that whether the data is tilted or not and the specific tilt degree are unknown, by using the method of this embodiment, tilt detection is performed on the data of the input data set, and then the subsequent steps S4 to S8 are performed according to whether the input data set contains tilt data or not, so that the execution efficiency of the Hash Join in the distributed database is improved.
In summary, the Hash Join execution method for detecting skewed data according to the present invention can perform skew detection on data, and can also adopt different distribution means for an input data set containing skewed data to distribute data to computing nodes of a Join node cluster, thereby improving the execution efficiency of Hash Join in a distributed database.
The principles and embodiments of the present invention have been described in detail using specific examples, which are provided only to aid in understanding the core technical content of the present invention. Based on the above embodiments of the present invention, those skilled in the art should make any improvements and modifications to the present invention without departing from the principle of the present invention, and therefore, the present invention should fall into the protection scope of the present invention.
Claims (8)
1. A Hash Join execution method for detecting tilt data, the method comprising:
before executing Hash Join by using an SQL engine of a database, acquiring two input data sets of the Join operator;
setting a relative inclination rate, calculating the product of the relative inclination rate and the data volume in one of the input data sets to obtain an inclination threshold value, wherein the value of the inclination threshold value exceeding the appearance frequency of the two input data sets is called as an inclination value;
for two input data sets, checking whether a tilt value exists in the input data set having a large total data amount,
if not, respectively carrying out hash distribution on the two input data sets so as to split the two input data sets and respectively distribute the two input data sets to the Join node cluster,
if the input data set exists, detecting all tilt values of the input data set with large total data volume in a certain field to obtain a tilt value list, splitting each input data set into a Normal data set without tilt values and a Skew data set with tilt values based on the tilt value list, carrying out Hash distribution on the Normal data set to split and distribute the Normal data set into a Join node cluster, carrying out average distribution or mirror distribution on the Skew data set to correspondingly distribute the Skew data set to all nodes of the Join node cluster after splitting and average distribution or copying;
nodes in the Join node cluster perform Hash Join calculation on data on the nodes.
2. The Hash Join execution method for detecting oblique data according to claim 1, wherein for two input data sets, the input data set with larger total data amount is called Big data set, and the data set with smaller total data amount is called Small data set.
3. The Hash Join implementation method for detecting tilt data according to claim 2, wherein the product of the relative tilt rate and the amount of sampled data in the Big data set is calculated to obtain the tilt threshold, and the values of the Big data set and Small data set whose occurrence frequency exceeds the tilt threshold are called tilt values;
and using a checker to check whether the Big data set has a tilt value in a certain field, and if the Big data set has the tilt value, using a detector to count all the tilt values of the Big data set in the certain field to obtain a tilt value list.
4. The Hash Join implementation method for detecting tilt data as claimed in claim 3, wherein the checker first sequentially samples the Big data set, calculates the amount of the sampled data and the relative tilt rate to obtain the tilt threshold, and then checks whether the tilt value exists in the sampled data.
5. The Hash Join execution method for detecting Skew data according to claim 3, wherein according to the list of Skew values, all the Skew values in Big dataset are split into Sew 1 dataset, the rest of data in Big dataset are split into Normal 1 dataset, at the same time, all the Skew values in Small dataset are split into Sew 2 dataset, and the rest of data in Small dataset are split into Normal 2 dataset;
hash router is used for carrying out Hash distribution on the Normal 1 data set and the Normal 2 data set respectively, so that the Normal 1 data set and the Normal 2 data set are split and distributed to the Join node cluster respectively, average router is used for carrying out average distribution on the Skaw 1 data set, so that the Skaw 1 data set is split and distributed to all nodes of the Join node cluster averagely, mirror router is used for carrying out mirror image distribution on the Skaw 2 data set, so that the Skaw 2 data set is copied into multiple parts and distributed to all nodes of the Join node cluster correspondingly.
6. The Hash Join execution method for detecting skewed data according to claim 5, wherein when a Hash router is used to Hash and distribute the Normal 1 dataset and the Normal 2 dataset, the same Hash function is used to perform computation on some fields in the Normal 1 dataset and the Normal 2 dataset, the Normal 1 dataset and the Normal 2 dataset are respectively split into a plurality of subsets according to the computation result, and the plurality of subsets of the Normal 1 dataset and the plurality of subsets of the Normal 2 dataset are respectively distributed to corresponding nodes of the Join node cluster.
7. The method as claimed in claim 6, wherein when averaging distribution is performed on the Skew 1 data set, the Skew 1 data set is divided into a plurality of subsets according to a field, and then a plurality of subsets are averaged to all nodes of the Join node cluster.
8. The Hash Join implementation method for detecting tilt data as claimed in claim 1, wherein the relative tilt rate is a value less than 1, the smaller the relative tilt rate, the more tilt values in the input data set, and vice versa, the larger the relative tilt rate, the less tilt values in the input data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211418045.1A CN115858523A (en) | 2022-11-14 | 2022-11-14 | Hash Join execution method for detecting tilt data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211418045.1A CN115858523A (en) | 2022-11-14 | 2022-11-14 | Hash Join execution method for detecting tilt data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115858523A true CN115858523A (en) | 2023-03-28 |
Family
ID=85663308
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211418045.1A Pending CN115858523A (en) | 2022-11-14 | 2022-11-14 | Hash Join execution method for detecting tilt data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115858523A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090248617A1 (en) * | 2008-03-26 | 2009-10-01 | Stephen Molini | Optimization technique for dealing with data skew on foreign key joins |
US20140108459A1 (en) * | 2012-10-12 | 2014-04-17 | International Business Machines Corporation | Functionality of decomposition data skew in asymmetric massively parallel processing databases |
CN107066612A (en) * | 2017-05-05 | 2017-08-18 | 郑州云海信息技术有限公司 | A kind of self-adapting data oblique regulating method operated based on SparkJoin |
CN112000467A (en) * | 2020-07-24 | 2020-11-27 | 广东技术师范大学 | Data tilt processing method and device, terminal equipment and storage medium |
-
2022
- 2022-11-14 CN CN202211418045.1A patent/CN115858523A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090248617A1 (en) * | 2008-03-26 | 2009-10-01 | Stephen Molini | Optimization technique for dealing with data skew on foreign key joins |
US20140108459A1 (en) * | 2012-10-12 | 2014-04-17 | International Business Machines Corporation | Functionality of decomposition data skew in asymmetric massively parallel processing databases |
CN107066612A (en) * | 2017-05-05 | 2017-08-18 | 郑州云海信息技术有限公司 | A kind of self-adapting data oblique regulating method operated based on SparkJoin |
CN112000467A (en) * | 2020-07-24 | 2020-11-27 | 广东技术师范大学 | Data tilt processing method and device, terminal equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
周娅等: ""CSPRJ:基于数据倾斜的MapReduce连接查询算法"", 《小型微型计算机系统》, 15 February 2018 (2018-02-15), pages 367 - 371 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Skyline community search in multi-valued networks | |
US9135301B2 (en) | Pushdown of sorting and set operations (union, intersection, minus) to a large number of low-power cores in a heterogeneous system | |
Afrati et al. | Optimizing joins in a map-reduce environment | |
WO2021068549A1 (en) | Data processing method, platform and system | |
US20120084287A1 (en) | Estimation of unique database values | |
WO2017096892A1 (en) | Index construction method, search method, and corresponding device, apparatus, and computer storage medium | |
Pagh et al. | Is min-wise hashing optimal for summarizing set intersection? | |
Yun et al. | Fastraq: A fast approach to range-aggregate queries in big data environments | |
Huang et al. | Joins on samples: A theoretical guide for practitioners | |
CN108536824B (en) | Data processing method and device | |
WO2021027331A1 (en) | Graph data-based full relationship calculation method and apparatus, device, and storage medium | |
CN113868230A (en) | Large table connection optimization method based on Spark calculation framework | |
CN111464451B (en) | Data stream equivalent connection optimization method and system and electronic equipment | |
Qian et al. | A fast and anti-matchability matching algorithm for content-based publish/subscribe systems | |
CN115858523A (en) | Hash Join execution method for detecting tilt data | |
CN110297858B (en) | Optimization method and device for execution plan, computer equipment and storage medium | |
CN108415889B (en) | Text similarity detection method based on weighted one-time permutation hash algorithm | |
CN116226242A (en) | Database hash connection processing method, device, equipment and storage medium | |
CN112579831B (en) | Network community discovery method, device and storage medium based on SimRank global matrix smooth convergence | |
CN110704515B (en) | Two-stage online sampling method based on MapReduce model | |
CN108898264B (en) | Method and device for calculating quality metric index of overlapping community set | |
Adil et al. | Performance analysis of duplicate record detection techniques | |
Zhang et al. | An approximate approach to frequent itemset mining | |
CN113297248B (en) | Data processing method, resource allocation method, device, equipment and readable storage medium | |
CN115952200B (en) | MPP architecture-based multi-source heterogeneous data aggregation query method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |