CN113868230B - Large-scale connection optimization method based on Spark computing framework - Google Patents
Large-scale connection optimization method based on Spark computing framework Download PDFInfo
- Publication number
- CN113868230B CN113868230B CN202111220042.2A CN202111220042A CN113868230B CN 113868230 B CN113868230 B CN 113868230B CN 202111220042 A CN202111220042 A CN 202111220042A CN 113868230 B CN113868230 B CN 113868230B
- Authority
- CN
- China
- Prior art keywords
- data
- spark
- connection
- probability
- rdda
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000005457 optimization Methods 0.000 title claims abstract description 11
- 238000001914 filtration Methods 0.000 claims abstract description 15
- 238000009826 distribution Methods 0.000 claims abstract description 12
- 238000005070 sampling Methods 0.000 claims abstract description 12
- 238000005520 cutting process Methods 0.000 claims abstract description 10
- 238000001514 detection method Methods 0.000 claims abstract description 8
- 238000005192 partition Methods 0.000 claims abstract description 3
- 230000011218 segmentation Effects 0.000 claims abstract description 3
- 238000009827 uniform distribution Methods 0.000 claims abstract description 3
- 102000008147 Core Binding Factor beta Subunit Human genes 0.000 claims description 12
- 108010060313 Core Binding Factor beta Subunit Proteins 0.000 claims description 12
- 238000003491 array Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 230000000717 retained effect Effects 0.000 claims description 4
- 208000022417 sinus histiocytosis with massive lymphadenopathy Diseases 0.000 claims description 4
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 230000002441 reversible effect Effects 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 2
- 230000014759 maintenance of location Effects 0.000 claims description 2
- 238000006467 substitution reaction Methods 0.000 claims description 2
- 238000004140 cleaning Methods 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/217—Database tuning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a large-table connection optimization method based on Spark computing frames, and belongs to the field of large data computing. The method comprises the following steps: s1: performing data cleaning by combining predicate pushing with a compressed bloom filter, filtering out a large amount of invalid data in a large table, and avoiding a large amount of useless data from entering a shuffle stage; s2: building a Spark-based data inclination detection model, and counting the Key value distribution in the global Map stage through a reservoir sampling algorithm; s3: and cutting the inclined data clusters according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state. The invention can filter out a large amount of useless data, improve the data inclination condition, shorten the connection inquiry time, solve the problem of memory overflow of Spark cluster nodes and improve the user satisfaction.
Description
Technical Field
The invention belongs to the field of big data calculation, relates to a large-scale connection optimization method based on Spark computing frames, and can be used for solving the problem of structured data inclination in the fields of smart grids, the Internet, industry and the like, and rapidly and accurately carrying out large-scale connection.
Background
Along with the continuous integration of the Internet, the Internet of things, the social network and the like into the life of people, the data explosion age has been entered, various data (such as data in the fields of smart grids, the Internet, industry and the like) are exponentially increased, and a huge data scale and a complex data relationship provide new challenges for the data analysis technology. The mass data is basically stored and calculated in a table form in a Spark big data calculation frame or a big database, and the connection is the most frequent and basic operation in data processing. In a big data environment, the data table is very large in size, and in a traditional relational database, such as: the processing join operations of Mysql, oracle, DB2, etc. are time consuming and therefore optimizing the large table join operations is necessary.
Spark can be fully compatible with a Hadoop distributed storage access interface, and the performance of a big data computing system is greatly improved by utilizing a distributed memory to process data. Spark SQL is a module for processing structured data in Spark, which can greatly reduce the difficulty of data analysis, but once the data table is very large and connection operation is performed, a large amount of invalid data enters the shuffle process. In addition, in the process of the buffer of the Spark SQL framework, all intermediate data are regarded as key/value structures, the Spark utilizes a hash algorithm to pull the same key into the same Task, if the data quantity difference between the keys is large, and a plurality of keys occupy a particularly large data proportion, data inclination occurs, so that CPU and memory resources of a system cannot be fully utilized in the process of Spark application processing, the operation time is prolonged by the large Task, even memory overflow abnormality is caused, and operation cannot be operated.
Disclosure of Invention
In view of the above, the present invention aims to provide a large table connection optimization method based on Spark computing frames, which is suitable for structured data such as smart grids, the internet or industry, and the like, and to provide two large tables to improve the data inclination condition during join operation, so as to solve the problem of high connection time consumption, further solve the problem of node memory overflow, shorten the connection query time, and improve user satisfaction.
In order to achieve the above purpose, the present invention provides the following technical solutions:
A large-table connection optimization method based on Spark computing framework includes the steps of firstly, carrying out data cleaning on two tables through predicate push down strategy and a compression bloom filter, and filtering out a large amount of useless data in the two tables. After data are cleaned, a data inclination detection model is built, after data inclination is detected, the middle data cluster is divided, the data quantity in each reduce node is balanced, the data inclination condition is improved, and the problem that memory overflow occurs in the large-scale connection process of the Spark node and the time consumption is high is solved.
The method specifically comprises the following steps:
s1: performing data cleaning by combining predicate pushing with a compressed bloom filter, filtering out a large amount of invalid data in a large table, and avoiding a large amount of useless data from entering a shuffle stage;
s2: building a Spark-based data inclination detection model, and counting the Key value distribution in the global Map stage through a reservoir sampling algorithm;
s3: and cutting the inclined data clusters according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state.
Further, the step S1 specifically includes: firstly, a filtering expression of the SQL expression is pushed down to a storage layer to directly filter data, so that the data quantity transmitted to calculation is reduced, and the data IO of other columns in the same row of scanning invalid data columns is reduced; and then carrying out Hash mapping by using the compressed Bloom Filter, finding out the attribute connection value commonly owned in the two tables and storing the attribute connection value into a new bit array A and a new bit array B, and carrying out network broadcasting bit array A and network broadcasting bit array B by using the compressed Bloom Filter to remove other invalid data which do not participate in the connection stage.
Further, the step S1 specifically includes the following steps:
S11: after the Spark calculation engine analyzes filters which can be pushed down, the Spark transmits the filters to Parquet for merging operation, and the merging operation is put on leaf nodes to enable the filters to be executed on a data source, so that filtering of invalid data is truly executed in the data source;
S12: putting the connection attribute of the left table and the right table after the execution of the step S11 into two new RDDs, and marking the connection attribute as RDDA and RDDB;
S13: in each node of the Spark cluster, respectively and sequentially reading the attribute connection values of RDDA and RDDB, adopting n groups of hash functions to calculate the attribute connection values in RDDA and RDDB, and then putting the calculated values into a bloom compression filter to generate a bit array until the connection attributes in all the nodes are calculated;
S14: performing OR operation on the RDDA and RDDB bit arrays in each node processed in the step S13 until the bit arrays of each node are processed to obtain CBFA and CBFB, and broadcasting the CBFA and the CBFB into the Spark cluster, wherein the compressed bloom filter is a data structure which greatly saves space, so that the network consumption is in an acceptance range;
S15: after each node of the Spark cluster receives the broadcast CBFA and CBFB, filtering RDDA by using CBFB, mapping the connection attribute in RDDA to the CBFB bit array by using n sets of hash functions of CBFB, which indicates that the connection attribute is common to RDDA and RDDB, and similarly, filtering RDDB by using CBFA, and performing the next connection operation on the filtered RDDA and RDDB.
Further, the step S2 specifically includes: by adopting a Master-Slaves mode, extracting Key value distribution and data by each Slave node through a reservoir sampling algorithm, wherein the probability of each sample being extracted is K/N, and the probability is equal, wherein K represents the number of extracted samples, and N represents the total number of samples; according to the frequency distribution of each Key in the sample, calculating the distribution situation approximate to the whole, and judging whether the large-table data Key value is inclined.
Further, the step S2 specifically includes the following steps:
S21: before performing the map task, a uniform random sample of fixed size k is first selected from the input data without substitution, the goal of this step being to form a base reservoir;
s22: from the (k+1) th sample, the probability of the sample in the reservoir being replaced by the (k+1) th data=the probability of the (k+1) th sample being selected ×the probability of the (i) th sample being replaced, namely The probability of being retained is
S23: for the jth data, where j > k, the probability that the jth data is selected is k/j; the probability of not being replaced by the j+1th number isWhen running to the nth data, the retained probability=the selected probability =the probability of not being replaced, i.e., the continuous multiplication of conditional probabilities: /(I)The probability of retention for each data is/>
S24: and finally, summarizing the data to the Master node by each Slave node to generate reservoir sampling data, wherein the reservoir sampling data can ensure that the key in the original data is closer to the whole condition.
Further, the step S3 specifically includes: after detecting data inclination in the step S2, calculating the average load rated capacity of a data cluster, setting inclination tolerance, and not cutting data when the transmission time of the data cluster data cutting network is longer than the node service processing time; the rest inclined data clusters are cut according to the average load rated capacity, and the size of each data cluster is ensured to be the same as much as possible.
Further, the step S3 specifically includes the following steps:
s31: the data set sampled in step S2 is sc= { SC i},SCi representing the key-value key value pair number of the sampled data;
S32: by calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB 1,DB2,…,DBh };
S33: reverse ordering of SC i, if SC i≥DB1, then a new segment will be split from Havg-sized SC i and loaded into DB 1, with the remainder of SC i-DB1 and the remaining clusters going to the next iterator;
S34: when SC i<DBi, put SC i into DB i, recheck the current second largest SC i-1 for the remaining space to see if it can fill DB i, if SC i+SCi-1≥DBi, SC i-1 would be split and traverse the remaining key pairs forward through all remaining DBs i to see if the remaining key pairs can be fit down;
S35: after each iteration, SC i will be reordered while setting the skew tolerance, and when Havg < SC i is less than or equal to Havg 1.1, SC i is not cut, and because the network overhead time caused by cutting SCi is longer than the data processing time of the current bucket, SCi within the skew tolerance is not processed.
The invention has the beneficial effects that: the invention is beneficial to filtering out a large amount of useless data when two large tables in the Spark computing framework are connected and inquired, improving the data inclination condition, shortening the connection and inquiry time, solving the problem of memory overflow of Spark cluster nodes and improving the user satisfaction.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is an overall flow chart of a large table connection optimization method based on Spark computing framework of the present invention;
FIG. 2 is a diagram of a predicate push strategy in combination with a compressed bloom filter for data filtering according to the present invention;
fig. 3 is a flowchart of a construction data inclination detection model.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Referring to fig. 1 to 3, the present invention provides a large table connection optimization method based on Spark computing framework, which specifically includes the following steps:
Step 1: the data is cleaned by using an extended Bloom Filter, and a large amount of useless data is filtered; the data here includes structured data such as smart grids, the internet, or industry.
As shown in FIG. 2, the predicate-push strategy is adopted first to perform primary filtering on the data of the tables A and B which do not meet the connection condition, two new RDDs are generated, the connection attributes of the tables A and B are respectively stored, then the compressed bloom filter is utilized to perform Hash mapping operation on RDDA and RDDB to obtain a plurality of node bit arrays, and finally the OR operation is performed on each bit array to generate a new table A and a new table B after data filtering.
Step 2: constructing a data inclination detection model, and counting the Key value distribution of a global Map stage in a reservoir sampling mode;
as shown in FIG. 3, the data detection model adopts a Master-Slave mode, a Master is deployed on a Driver node of Spark, and a Slave is deployed on a workbench node. And carrying out reservoir sampling according to a set ratio by utilizing RDD SAMPLE operators, increasing a Key counter continuously along with the sampling, and judging whether a Key distribution histogram reaches a steady state or not by utilizing a Simon model in a long tail theory, wherein the sampling is finished after the steady state is reached. And finally, returning the Key distribution histogram to the Master node by each Slave node to generate a global Key data inclination detection model.
Step 3: load balancing strategy of Reduce node aiming at data inclination;
1) Recording the sampled data set as SC= { SC i},SCi to represent the key-value key value pair quantity of the sampled data;
2) By calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB 1,DB2,…,DBh };
3) Reverse ordering of SC i, if SC i≥DB1, then a new segment will be split from Havg-sized SC i and loaded into DB 1, with the remainder of SC i-DB1 and the remaining clusters going to the next iterator;
4) When SC i<DBi, put SC i into DB i, recheck the current second largest SC i-1 for the remaining space to see if it can fill DB i, if SC i+SCi-1≥DBi, SC i-1 would be split and traverse the remaining key pairs forward through all remaining DBs i to see if the remaining key pairs can be fit down;
5) After each iteration, SC i will be reordered while setting the skew tolerance, and when Havg < SC i is less than or equal to Havg 1.1, SC i is not cut, and because the network overhead time caused by cutting SCi is longer than the data processing time of the current bucket, SCi within the skew tolerance is not processed.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.
Claims (3)
1. A large-table connection optimization method based on Spark computing framework is characterized by comprising the following steps:
S1: the predicate pushing is utilized to combine with the compressed bloom filter to clean data, a large amount of invalid data in a large table is filtered, and a large amount of useless data is prevented from entering a shuffle stage, and the method specifically comprises the following steps: firstly, pushing down a filtering expression of the SQL expression to a storage layer to directly filter data; then, hash mapping is carried out by utilizing the compressed Bloom Filter, the attribute connection value commonly owned in the two tables is found and stored into a new bit array A and a new bit array B, and the compressed Bloom Filter is utilized to carry out network broadcasting on the bit array A and the bit array B so as to remove other invalid data which do not participate in the connection stage;
s2: building a Spark-based data inclination detection model, and counting the Key value distribution of a global Map stage through a reservoir sampling algorithm, wherein the method specifically comprises the following steps: by adopting a Master-Slaves mode, extracting Key value distribution and data by each Slave node through a reservoir sampling algorithm, wherein the probability of each sample being extracted is K/N, K represents the number of extracted samples, and N represents the total number of samples; calculating the distribution condition approximate to the whole according to the frequency distribution of each Key in the sample, and judging whether the large-table data Key value is inclined or not;
S3: cutting the inclined data cluster according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state, and the method specifically comprises the following steps: after detecting data inclination in the step S2, calculating the average load rated capacity of a data cluster, setting inclination tolerance, and not cutting data when the transmission time of the data cluster data cutting network is longer than the node service processing time; cutting the rest inclined data clusters according to the average load rated capacity;
the step S3 specifically comprises the following steps:
s31: the data set sampled in step S2 is sc= { SC i},SCi representing the key-value key value pair number of the sampled data;
S32: by calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB 1,DB2,…,DBh };
S33: reverse ordering of SC i, if SC i≥DB1, then a new segment will be split from Havg-sized SC i and loaded into DB 1, with the remainder of SC i-DB1 and the remaining clusters going to the next iterator;
S34: when SC i<DBi, put SC i into DB i, recheck the current second largest SC i-1 for the remaining space to see if it can fill DB i, if SC i+SCi-1≥DBi, SC i-1 would be split and traverse the remaining key pairs forward through all remaining DBs i to see if the remaining key pairs can be fit down;
s35: after each iteration, SC i will be reordered while setting the tilt tolerance, and SC i will not be cut when Havg < SC i is less than Havg 1.1.
2. The large table connection optimization method according to claim 1, wherein the step S1 specifically includes the steps of:
S11: after the Spark calculation engine analyzes filters which can be pushed down, the Spark transmits the filters to Parquet for merging operation, and the merging operation is put on leaf nodes to enable the filters to be executed on a data source, so that filtering of invalid data is truly executed in the data source;
S12: putting the connection attribute of the left table and the right table after the execution of the step S11 into two new RDDs, and marking the connection attribute as RDDA and RDDB;
S13: in each node of the Spark cluster, respectively and sequentially reading the attribute connection values of RDDA and RDDB, adopting n groups of hash functions to calculate the attribute connection values in RDDA and RDDB, and then putting the calculated values into a bloom compression filter to generate a bit array until the connection attributes in all the nodes are calculated;
S14: performing OR operation on the bit arrays RDDA and RDDB in each node processed in the step S13 until the bit arrays of each node are processed to obtain CBFA and CBFB, and broadcasting the CBFA and the CBFB into a Spark cluster;
S15: after each node of the Spark cluster receives the broadcast CBFA and CBFB, filtering RDDA by using CBFB, mapping the connection attribute in RDDA to the CBFB bit array by using n sets of hash functions of CBFB, which indicates that the connection attribute is common to RDDA and RDDB, and similarly, filtering RDDB by using CBFA, and performing the next connection operation on the filtered RDDA and RDDB.
3. The large table connection optimization method according to claim 1, wherein the step S2 specifically includes the steps of:
S21: before performing a map task, first selecting a uniform random sample with a fixed size k from the input data without substitution to form a base reservoir;
s22: from the (k+1) th sample, the probability of the sample in the reservoir being replaced by the (k+1) th data=the probability of the (k+1) th sample being selected ×the probability of the (i) th sample being replaced, namely The probability of being retained is/>
S23: for the jth data, where j > k, the probability that the jth data is selected is k/j; the probability of not being replaced by the j+1th number isWhen running to the nth data, the retained probability=the selected probability =the probability of not being replaced, i.e., the continuous multiplication of conditional probabilities: /(I)The probability of retention for each data is/>
S24: and finally, summarizing the data to the Master node by each Slave node to generate reservoir sampling data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111220042.2A CN113868230B (en) | 2021-10-20 | 2021-10-20 | Large-scale connection optimization method based on Spark computing framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111220042.2A CN113868230B (en) | 2021-10-20 | 2021-10-20 | Large-scale connection optimization method based on Spark computing framework |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113868230A CN113868230A (en) | 2021-12-31 |
CN113868230B true CN113868230B (en) | 2024-06-04 |
Family
ID=78996709
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111220042.2A Active CN113868230B (en) | 2021-10-20 | 2021-10-20 | Large-scale connection optimization method based on Spark computing framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113868230B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116561171B (en) * | 2023-07-10 | 2023-09-15 | 浙江邦盛科技股份有限公司 | Method, device, equipment and medium for processing dual-time-sequence distribution of inclination data |
CN116719846A (en) * | 2023-08-07 | 2023-09-08 | 北京滴普科技有限公司 | Distributed computing engine data query optimization method, device and storage medium |
CN117633024B (en) * | 2024-01-23 | 2024-04-23 | 天津南大通用数据技术股份有限公司 | Database optimization method based on preprocessing optimization join |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106598729A (en) * | 2016-11-18 | 2017-04-26 | 深圳市证通电子股份有限公司 | Data distribution method and system of distributed parallel computing system |
CN108572873A (en) * | 2018-04-24 | 2018-09-25 | 中国科学院重庆绿色智能技术研究院 | A kind of load-balancing method and device solving the problems, such as Spark data skews |
CN108628889A (en) * | 2017-03-21 | 2018-10-09 | 北京京东尚科信息技术有限公司 | Sampling of data mthods, systems and devices based on timeslice |
CN108763489A (en) * | 2018-05-28 | 2018-11-06 | 东南大学 | A method of optimization Spark SQL execute workflow |
CN110659304A (en) * | 2019-09-09 | 2020-01-07 | 杭州中科先进技术研究院有限公司 | Multi-path data stream connection system based on data inclination |
CN112000467A (en) * | 2020-07-24 | 2020-11-27 | 广东技术师范大学 | Data tilt processing method and device, terminal equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299131B (en) * | 2018-11-14 | 2020-05-29 | 百度在线网络技术(北京)有限公司 | Spark query method and system supporting trusted computing |
-
2021
- 2021-10-20 CN CN202111220042.2A patent/CN113868230B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106598729A (en) * | 2016-11-18 | 2017-04-26 | 深圳市证通电子股份有限公司 | Data distribution method and system of distributed parallel computing system |
CN108628889A (en) * | 2017-03-21 | 2018-10-09 | 北京京东尚科信息技术有限公司 | Sampling of data mthods, systems and devices based on timeslice |
CN108572873A (en) * | 2018-04-24 | 2018-09-25 | 中国科学院重庆绿色智能技术研究院 | A kind of load-balancing method and device solving the problems, such as Spark data skews |
CN108763489A (en) * | 2018-05-28 | 2018-11-06 | 东南大学 | A method of optimization Spark SQL execute workflow |
CN110659304A (en) * | 2019-09-09 | 2020-01-07 | 杭州中科先进技术研究院有限公司 | Multi-path data stream connection system based on data inclination |
CN112000467A (en) * | 2020-07-24 | 2020-11-27 | 广东技术师范大学 | Data tilt processing method and device, terminal equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
Spatial data management in apache spark: the GeoSpark perspective and beyond.《Springer》.2018,第23卷37-78. * |
基于Spark的结构化数据连接查询优化策略研究;宾茂梨;《重庆邮电大学硕士学位论文》;20240418;1-81 * |
基于大数据多维分析的近似查询处理技术研究;谢金星;《中国优秀硕士学位论文全文数据库信息科技辑》;20180315(第03期);I138-1242 * |
Also Published As
Publication number | Publication date |
---|---|
CN113868230A (en) | 2021-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113868230B (en) | Large-scale connection optimization method based on Spark computing framework | |
Hilprecht et al. | Deepdb: Learn from data, not from queries! | |
CN109669934B (en) | Data warehouse system suitable for electric power customer service and construction method thereof | |
US11003649B2 (en) | Index establishment method and device | |
CN109241093B (en) | Data query method, related device and database system | |
CN110837585B (en) | Multi-source heterogeneous data association query method and system | |
US9135301B2 (en) | Pushdown of sorting and set operations (union, intersection, minus) to a large number of low-power cores in a heterogeneous system | |
CN102521406B (en) | Distributed query method and system for complex task of querying massive structured data | |
US8935233B2 (en) | Approximate index in relational databases | |
CN111259933B (en) | High-dimensional characteristic data classification method and system based on distributed parallel decision tree | |
US8037057B2 (en) | Multi-column statistics usage within index selection tools | |
CN110389950B (en) | Rapid running big data cleaning method | |
CN105117488B (en) | A kind of distributed storage RDF data balanced division method based on hybrid hierarchy cluster | |
CN110909111A (en) | Distributed storage and indexing method based on knowledge graph RDF data characteristics | |
US9110949B2 (en) | Generating estimates for query optimization | |
WO2020211466A1 (en) | Non-redundant gene clustering method and system, and electronic device | |
CN109325062B (en) | Data dependency mining method and system based on distributed computation | |
CN108073641B (en) | Method and device for querying data table | |
WO2019184325A1 (en) | Community division quality evaluation method and system based on average mutual information | |
CN112148830A (en) | Semantic data storage and retrieval method and device based on maximum area grid | |
CN110597929A (en) | Parallel data cube construction method based on MapReduce | |
CN116701351A (en) | Function dependence approximation discovery method suitable for big data | |
WO2024016569A1 (en) | Index recommendation method and apparatus based on data feature | |
Svynchuk et al. | Modification of Query Processing Methods in Distributed Databases Using Fractal Trees. | |
Behr et al. | Learn What Really Matters: A Learning-to-Rank Approach for ML-based Query Optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |